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Abstract 


This work addresses the use of commercial off-the-shelf (COTS) rotor-based unmanned aerial 
vehicles (UAVs) to facilitate emergency forces in the rapid structural assessment of a disaster 
site, by means of aerial image-based reconnaissance. It proposes a framework that consists 
of two parts and relies on the integrated stereo vision sensor and the visual payload camera 
of the UAV to execute three high-level applications that aim at facilitating first responders in 
disaster relief missions. Ihe first part aims at (i) increasing the autonomous and safe use of 
the UAV, by relying on the image data from the integrated stereo vision sensor to perform 
real-time obstacle detection and collision avoidance. For this purpose, this work presents 
an approach for real-time and low-latency disparity estimation, which is optimized for the 
execution on an on-board embedded device and computes disparity maps that depict the en- 
vironment in the flight-path of the UAV. These disparity maps, in turn, serve as input to a 
subsequent algorithm for real-time obstacle detection and reactive collision avoidance. The 
second part of the presented framework addresses (ii) fast 3D mapping and 3D change de- 
tection of the disaster site, based on image data captured by the visual payload camera of 
the UAV. In this, an approach for fast and dense multi-view stereo depth estimation from 
multiple aerial images captured by a single moving camera is presented. In addition, this 
work also presents an approach for single-view depth estimation from a single aerial image. 
While the former approach is based on conventional methods for dense image matching, the 
latter one is a data-driven approach relying on deep learning. The estimation of depth data 
from a single input image allows to also consider scenes that contain dynamic objects, such 
as moving cars. An important assumption in the second part of the framework is that during 
the flight of the UAV the image data from the main camera is transmitted to a ground control 
station, which is equipped with more hardware resources. However, details of the data-link 
are not covered by this work. 


For the two-view disparity estimation from image data of the stereo vision sensor, a well- 
known and proven approach from literature is optimized for real-time processing on an em- 
bedded system equipped with a CPU and a GPU. Embedded systems with a built-in GPU are 
increasingly deployed on COTS UAVs, since they are more versatile than FPGA-based systems 
and better suited for running artificial intelligence (AI) applications. For the task of fast 3D 
mapping, an approach is presented that uses a hierarchical processing pipeline to compute 
dense depth maps from three or more images. The hierarchical processing scheme, as well as 
the improved method for surface-aware regularization, allows to efficiently and accurately 


Abstract 


reconstruct slanted surfaces structures which are particularly prominent in oblique aerial 
imagery. The third approach presented in this work uses a deep-learning-based approach to 
estimate a depth map from a single aerial image. In doing so, it is trained in a self-supervised 
manner, which does not require any special training data and thereby increases its flexibility 
with respect to its practical use. 


In this work, the three methodological approaches for two-view, multi-view and single-view 
depth estimation are presented in detail and thoroughly evaluated. Their performance with 
respect to the considered use-cases is extensively discussed and demonstrated. This work 
shows (i) that the presented approach for real-time two-view disparity estimation on embed- 
ded devices achieves state-of-the-art results and is, at the same time, well-suited for real-time 
and low-latency obstacle detection and reactive collision avoidance on board the UAV. (ii) 
Furthermore, it is shown that the methodological extensions within the approach for multi- 
view depth estimation allow for both fast and accurate reconstruction of the depicted scene. 
(iii) The experiments and investigations conducted with respect to the approach for single- 
view depth estimation by means of deep learning show, that even though it is not yet suited 
for a stand-alone use, it is well-suited for a complementary use in addition to approaches that 
perform depth estimation by means of conventional dense image matching. (iv) In conclu- 
sion, both the disparity maps computed from the stereo image pair, as well as the multi-view 
depth maps are well-suited for the considered use-cases, namely real-time obstacle detection 
and collision avoidance as well as 3D mapping and 3D change detection, respectively. 
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Diese Arbeit befasst sich mit dem Einsatz von handelsüblichen, drehflügel-basierten UAVs zur 
Unterstützung von zivilen Einsatzkräften bei der Lageerfassung und Übersichtsgewinnung an 
Einsatzorten. Besonders Ersthelfern kann ihre Arbeit durch eine rasche Bewertung der Lage 
am Einsatzort mit Hilfe von Luftbildern, die von handelsüblichen UAV erfasst werden, erleich- 
tert werden. Dazu stellt diese Arbeit ein zweiteiliges Rahmenwerk vor. Mit dem ersten Teil 
des Rahmenwerkes soll (i) die Autonomie und Sicherheit des UAVs, durch einen Ansatz zur 
automatischen Hinderniserkennung und Kollisionsvermeidung anhand von Bilddaten einer 
Stereokamera, erhöht werden. Hierfür wird im Rahmen dieser Arbeit ein Verfahren zur Dis- 
paritätsschätzung anhand von Stereo-Bildpaaren vorgestellt, welches in Echtzeit auf einem 
eingebetteten System an Bord des UAVs ausgeführt werden kann. Dies nutzt die Bilddaten 
der im UAV eingebauten Stereokamera, um Disparitätskarten zu berechnen, die das Umfeld 
in Flugrichtung des UAVs abbilden und welche als Eingabe für ein nachgelagerten Algorith- 
mus zur Hinderniserkennung und reaktiven Kollisionsvermeidung dienen. Der zweite Teil 
des vorgestellten Rahmenwerkes befasst sich mit der (ii) schnellen und echtzeitnahen 3D- 
Kartierung des Einsatzgebietes, sowie mit einer schnellen 3D-Änderungsdetektion, auf Basis 
der Bilddaten, welche von der Hauptkamera des UAVs erfasst werden. Hierfür wird zum einen 
ein Verfahren zur schnellen und dichten Tiefenschätzung aus mehreren Luftbildern, welche 
von einer sich um die Szene bewegenden Kamera erfasst werden, vorgestellt. Des Weiteren 
wird ein Verfahren für die Tiefenschätzung aus einem einzelnen Luftbild vorgestellt. Wäh- 
rend Ersteres auf konventionelle Methoden des Bildmatchings beruht, verfolgt das zweite 
Verfahren einen datengetriebenen und lernbasierten Ansatz. Die Schätzung von Tiefendaten 
aus nur einem Bild erlaubt die Berücksichtigung von Szenen, die dynamische Objekte, wie 
bspw. fahrende Autos, enthält. Eine wichtige Annahme im zweiten Teil des Rahmenwerkes 
ist, dass die Bilddaten der Hauptkamera während des Fluges an eine Bodenkontrollstation, 
die mit mehr Hardwareressourcen ausgestattet ist, übermittelt werden. Details zur Datenver- 
bindung sind aber nicht Gegenstand dieser Arbeit. 


Für die Disparitätsschätzung aus Stereobildpaaren wird ein, in der Literatur bekanntes und 
bewährtes Verfahren für die Echtzeitverarbeitung auf einem eingebetteten System, welches 
mit einer CPU und einer GPU ausgestattet ist, optimiert. Eingebettete Systeme mit einer ein- 
gebauten GPU werden zunehmend auf handelsüblichen Drohnen verbaut, da sie gegenüber 
FPGA-basierten Systemen vielseitiger einsetzbar und für die Ausführung von KI-Verfahren 
besser geeignet sind. Für die schnelle 3D-Kartierung wird ein Verfahren vorgestellt, welches 
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anhand einer hierarchisch aufgebauten Verarbeitungskette Tiefenkarten aus drei oder meh- 
reren Bildern berechnet. Der hierarchische Ansatz, sowie eine verbesserte Regularisierung, 
welche auch Oberflächenstrukturen berücksichtig, erlaubt eine effiziente und genaue Rekon- 
struktion von geneigten Oberflächen, welche besonders in Luftbildern mit Schrägsicht do- 
minant sind. Das dritte, in dieser Arbeit vorgestellte Verfahren, nutzt einen lernbasierten 
Ansatz zur Schätzung einer Tiefenkarte aus einem einzelnen Luftbild. Dabei wird es selbst- 
überwacht trainiert, was keine speziellen Trainingsdaten erfordert und damit die Flexibilität 
in der Anwendung des Verfahrens erhöht. 


Die drei, in dieser Arbeit vorgestellten methodischen Verfahren zur Tiefenschätzung im Hin- 
blick auf den Zwei-, Mehr- und Einzelbildfall, werden detailliert beschrieben und auf Basis 
umfangreicher Untersuchungen hinsichtlich ihrer Stärken und Schwächen beleuchtet. Da- 
bei werden insbesondere ihre Leistungsfähigkeiten im Zusammenhang der betrachteten An- 
wendungsfälle demonstriert und herausgearbeitet. Darin zeigt sich, (i) dass das vorgestellt 
Verfahren für die Echtzeit-Disparitätsschätzung aus Stereobildpaaren auf eingebetteten Sys- 
temen state-of-the-art Ergebnisse erzielt und gleichzeitig sehr gut für eine Hinderniserken- 
nung und reaktive Kollisionsvermeidung an Bord des UAVs in Echtzeit geeignet ist. (ii) Des 
Weiteren wird gezeigt, dass die methodischen Erweiterungen innerhalb des Verfahrens für 
die Mehrbild-Tiefenschätzung sowohl eine schnelle als auch genau Erfassung der Geometrie 
erlauben. (iii) Die Untersuchungen zur Tiefenschätzung aus einem einzelnen Luftbild mittels 
eines lernbasierten Verfahrens zeigen, dass ein solches Verfahren zwar noch nicht für den 
alleinigen Betrieb geeignet ist, sich aber gut für den komplementären Gebrauch, zusätzlich 
zu den Methoden, die auf konventionelles Bildmatching beruhen, eignet. Da hierbei die Stär- 
ken beider Modalitäten kombiniert werden können. (iv) Abschließend zeigt sich, dass sich 
sowohl die Disparitätskarten, die anhand des Stereobildpaares berechnet werden, sowie die 
Tiefenkarten des Verfahrens für die Mehrbild-Tiefenschätzung, gut für die betrachteten An- 
wendungsfälle, nämlich Hinderniserkennung und Kollisionsvermeidung bzw. 3D-Kartierung 
und 3D-Änderungsdetektion, eignen. 
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Terminology 


This chapter includes material from the following publication: 


Ruf, B.; Weinmann, M., and Hinz, S. (2021b): “FaSS-MVS - Fast multi-view stereo with surface-aware semi- 
global matching from UAV-borne monocular imagery”. In: arXiv preprint arXiv:2112.00821v1. Reprinted with 
permission. It is cited as (Ruf et al. 2021b) and marked with a green sidebar. 


This chapter introduces important terminologies which are used in this work in order to 
resolve ambiguous definitions and interpretations. 


Commercial Off-the-Shelf Unmanned Aerial Vehicles 


Remotely controlled aircrafts can be grouped into a number of different categories, such 
as fixed-wing or rotor-based aircrafts, high-altitude-long-endurance or medium-altitude- 
long-endurance aircrafts, those that can be bought off-the-shelf by anyone or those that are 
uniquely manufactured and flown by experts. This work focuses on the use of commercial 
off-the-shelf (COTS) unmanned aerial vehicles (UAVs), meaning that the UAVs can be bought 
and flown off-the-shelf by anyone, without the need of special training. Only an appropriate 
license might be required, depending on the responsible aviation agency, e.g. the European 
Aviation Safety Agency (EASA). While this definition is not restricted to a single type of 
UAV, the most commonly available and most easy to use, are the rotor-based UAVs. Thus, 
in the scope of this work, unless explicitly specified, the terminology UAV stands for rotor- 
based remotely controlled aircrafts. Lastly, it is deliberately refrained from categorizing the 
considered UAVs by their costs, since depending on the application or use-case the category 
“low-cost” is defined differently. 


Dense Image Matching and Depth Estimation 


From two or more images that depict the same scene from different vantage points, the 3D 
scene structure and, in turn, the scene depth with respect to a selected reference view can be 
calculated. Here, the scene depth, perceived from a certain viewpoint with defined charac- 
teristics, e.g. field-of-view (FOV) and image resolution, is typically stored inside a 2D depth 
map, holding the pixel-wise orthogonal distance of the depicted object from the optical center 
of the camera. The process of calculating these depth maps is denoted as depth estimation. 
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When performing depth estimation from two or more images, a dense correspondence field 
between the pixels of each considered image needs to be established. This is usually done by 
dense image matching (DIM), which, in contrast to the matching of point features, uses direct 
pixel matching to find a correspondence for all pixels in the image. 


Monocular and Single-Image 


In literature, the two terminologies of “monocular” and “single-image” are often mixed and 
not used in a consistent manner. In the scope of this work, the term monocular is used when 
referring to the use of only one camera. For example, this is the case for monocular multi-view 
depth estimation, where the scene depth is calculated based on two or more images captured 
by one camera from different viewpoints. Evidently, the capturing of images from different 
viewpoints by one camera can only be done at different moments in time. In contrast, the 
term single-image or single-view is used, when truly only one image is considered for depth 
estimation or other tasks. 


On-Board and Off-Board Processing 


On-board processing refers to the use of specialized and embedded hardware, that allows to 
execute algorithms directly on board the sensor platform. The use of the term “on-board” 
does not imply any details on the run-time of the algorithm. On-board processing can be 
done with or without time constraints. To denote the execution of algorithms on hardware 
that is not directly connected to the sensor platform or the UAV, the term off-board is used. 
This, for example, applies for the execution of algorithms on a ground control station (GCS). 


Real-Time, Online and Offline Processing 


The run-time requirement for an algorithm, and whether it is denoted as real-time or not, 
greatly depends on the rate of the input data and the actual application in which the algorithm 
is used. Thus, in literature, the use of the word “real-time” is not consistent and does not 
always mean the same thing. In the scope of this work, real-time processing is used to denote 
a fast and low-latency processing. This means that the processing rate of the algorithm is high 
enough, in order to allow an immediate reaction based on the calculated results. For example, 
in the case of reactive collision avoidance based on depth data from a stereo camera, the 
calculation of the scene depth needs to be fast enough to still be able to initiate an appropriate 
maneuver to avoid imminent collision. Real-time processing is deliberately not defined by 
a minimum processing rate, since the available reaction time depends on various factors, 
such as flight speed. In contrast, online processing is used to denote a fast processing of the 
algorithm without setting hard time constraints. Ideally, an algorithm for online processing 
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should be able to keep up with the frame rate of the input data. At the same time, however, 
it is not critical if the algorithm has a high latency and if the results are only available a 
couple of frames after the input of corresponding reference frame. This, for example, is the 
case when performing depth estimation from two or more images that are captured by a 
single moving camera, i.e. monocular multi-view stereo. Here, enough input images with 
appropriate baseline have first to be collected before the actual processing can be started. Both 
real-time and online processing refer to a computation during the acquisition or receiving 
of the input data. Thus, corresponding algorithms only have a confined set of input data 
available during execution. Offline processing, on the other hand, is done fully disconnected 
from the actual acquisition of the input data and, thus, it is assumed that algorithms, which 
are executed offline, have access to all available input data. (Ruf et al. 2021b, Sec. 1.) 


Structure-from-Motion, Multi-View Stereo and Simultaneous Localization and 
Mapping 


In accordance with the definition of Schönberger et al. (2016), structure-from-motion (SFM) 
denotes the sparse modeling and camera pose computation from image data based on detect- 
ing and matching discriminative image features. Multi-view stereo (MVS), on the other hand, 
names the process of dense modeling from image data with corresponding camera poses by 
means of dense image matching and dense depth estimation. Evidently, as instantiated by 
common software suites for offline photogrammetric 3D reconstruction, such as COLMAP 
(Schönberger and Frahm 2016, Schönberger et al. 2016), SFM is executed prior to MVS. Si- 
multaneous localization and mapping (SLAM) algorithms are considered as a special form of 
SFM, as they are intended for online and even real-time processing, whereas generic SFM 
does not have any implication on run-time or the availability of input data. 
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Notation 


This chapter introduces the notation and symbols which are used in this work. 


General Notation 


Scalars 

Sets 

2D Points/Vectors 
3D Points/Vectors 
Matrices 


Functions 


italic Roman and Greek lowercase letters 
calligraphic and Greek uppercase letters 
Roman lowercase letters 
Roman uppercase letters 


bold Roman uppercase letters 


EZES 


italic Roman uppercase letters 


Dense Image Matching 


stereo baseline, given in a metric scale 


3D extrinsic translation vector, denoting the location of the 
camera center with respect to a reference coordinate system 


cost function to quantify the similarity between two pixels 
census transform 


orthogonal distance of the scene plane II from the optical center 
of the camera 


depth, i.e. orthogonal distance of a scene point from the optical 
center of the camera 


2D depth map 


actual focal length of the camera optics, given in a metric scale 


camera constant, i.e. mathematical focal length of the pinhole 
camera model, in u- and v-direction 


fundamental matrix between two calibrated stereo images 
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Notation 


Ty = [Aumin Aumax] 
Ta = ladiini tase 


hy, hy 


H(I, Pret, Px) 


P = K[R-RC] 


R € SO(3) 


Au 


XX 


disparity range in which a dense image matching is performed 


depth range used to sample the scene-space on the highest 
pyramid level 


actual size of a sensor element, i.e. pixel, in u- and v-direction, 
given in a metric scale 


homography induced by the scene plane II between a reference 
camera with projection matrix P,., and a neighboring camera 
with projection matrix Pk 


images of the left and right camera of a stereo camera rig 
input image k within the multi-image setup 

reference image within the multi-image setup 

intrinsic camera calibration matrix 

epipolar line corresponding to the image point m 
projection of the scene point M in image J 

scene point in the camera coordinate system 

3D normal vector of the scene plane Il 

normalized cross correlation 

image patch around the pixel p 


scene plane with the normal vector Ny and located at the 
orthogonal distance 6 from the camera 


pixel in the reference image J, and the matching image Jg 
respectively 


pixel in the reference image J,.¢ and the matching image J; 
respectively 


full camera projection matrix, made up of camera calibration 
matrix K, extrinsic rotation R and translation C 


extrinsic rotation matrix, with the row vectors holding the 
normalized coordinate axes of the camera coordinate system 
with respect to the reference coordinate system 


three-dimensional cost volume, holding the matching score s for 
each pixel p and the disparity Au € T, or the plane II, 
depending on the stereo setup used 


disparity, i.e. horizontal displacement in the image 
2D disparity map 


2D image coordinates 


Notation 


Semi-Global Matching 


E energy function of the Markov random field 

V smoothness constraint within E 

L, 1D aggregation path, as part of the dynamic programming to 
approximate the solution of E 

r direction of the aggregation path L 

P1: P2 smoothness penalties, penalizing deviations in the disparity 


between neighboring pixels 


S optimized cost volume, holding the aggregated matching score s 
for each pixel p and disparity Au € T, after path aggregation 


Geometric Consistency Filtering 


Eh hit count of reprojections, for which €, < 7,, within the scope of 
a multi-view geometric consistency check 


€ reprojection error within the scope of a multi-view geometric 
consistency check 


Nh threshold for consistent hits within the scope of a multi-view 
geometric consistency check 


Nr threshold for the reprojection error within the scope of a 
multi-view geometric consistency check 


Fast Multi-View Stereo with Surface-Aware Semi-Global 
Matching 


(G estimated confidence map corresponding to J,e € Q, holding 
confidence values with respect to the estimates of D and N 

ek, epipole of the reference camera, i.e. projection of C,.r in the 
image J, 

Aipg discrete index jump through a set of sampling planes to adjust 


zero-cost transition within Viz_p, 


Ai discrete index jump through a set of sampling planes to adjust 
zero-cost transition within Vy.sn 


xxi 


Notation 


IC) 


Ss 


indexing function to return the index of a certain plane II 
within the set of sampling planes 


pyramid level of the Gaussian image pyramid within the 
hierarchical processing scheme 


height of the Gaussian image pyramid 

estimated normal map corresponding to Jef E Q 

input bundle, consisting of input images J, and projection 
matrices P, 

scene planes at minimum and maximum distance, bounding 
sampling space 

perspective-invariant cross-ratio 

running gradient of the minimum cost path along the 
aggregation path r 


matching cost computed by C and aggregated for left and right 
subset of matching images 


ray casted from C through the 2D image point m 


plane-wise smoothness constraint, adopting the E to depth 
hypotheses produced by plane-sweep multi-image matching 


surface normal smoothness constraint, extending Vq with an 
adjusted zero-cost transition based on the surface normal in 
order to account for non-fronto-parallel surface structures 


path gradient smoothness constraint, extending Vq with an 
adjusted zero-cost transition based on the gradient of the 
minimum cost path in order to account for non-fronto-parallel 
surface structures 


Self-Supervised Single-View Depth Estimation 


Eek = [R T] 


xxii 


relative extrinsic transformation between the reference image 
Jef and the neighboring image Ik 


loss function 


projective transformation, projecting the matching image Jk 
into the view of the reference image J,., given K, D and E ef>k 


image reconstruction error 


structural similarity score 


1 Introduction 


Even though remotely controlled aircrafts have been around for quite a long time now, until 
recently they were niche products, mostly used by military or by model aircraft enthusiasts. 
However, in the last decade, the availability of commercial off-the-shelf (COTS) rotor-based 
unmanned aerial vehicles (UAVs), that can be purchased by anyone and are easy to fly, has 
increased tremendously. According to a recent report by Markets-and-Markets (2021), it is 
estimated that the global UAV market, including all different sectors, is to reach 58.4 billion 
USD by 2026, which is a compound annual growth rate (CAGR) of 16.4 % compared to global 
market in 2021. In this, with a relative CAGR of 28 %, the commercial segment is expected 
to have the highest growth rate in the forecast period, which is attributed to the advance- 
ments of COTS UAVs (Markets-and-Markets 2021). In recent years, these developments and 
advancements have led to an increasing use of COTS UAVs to facilitate a wide range of com- 
mercial applications, such as aerial video and photography, civil construction and industrial 
inspection (Chen et al. 2014, Sankarasrinivasan et al. 2015), precision agriculture (Perz and 
Wronowski 2019, Tsouros et al. 2019), security monitoring (Finn and Wright 2012), disaster 
relief (Estrada and Ndoma 2019, Restas 2015, Kerle et al. 2020, Furutani and Minami 2021), as 
well as search-and-rescue (SAR) missions (Silvagni et al. 2017, Doherty and Rudol 2007). This 
work focuses on the use of COTS UAVs by emergency forces in order to facilitate disaster 
relief as well as SAR missions. In the following, a short review on the ability and the use 
of modern COTS UAVs to execute specific tasks that assist emergency forces in their mis- 
sion is provided (Section 1.1), before giving an overview on the proposed framework and the 
structure of this work (Section 1.2). 


1.1 The Use of Commercial Off-the-Shelf UAVs for 
Disaster Relief and Search-and-Rescue Missions 


The commercial UAV market offers a great variety of different UAVs, ranging from smaller 
low-cost models, which are primarily intended for the private use, up to high-end models for 
professional use. The latter ones allow for a customization with a number of different sensors 
and equipment, meeting the special needs of the application and task that is to be executed. As 
illustrated by Figure 1.1, the DJI Matrice 210v2 RTK, for example, allows the use of different 
types of sensors which, in turn, are designed to execute a variety of tasks for facilitating 
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Localization S 
en Path Planning / 


- IMU Navigation 


On-board PU 
DJI Manifold 2-G 


Close Vicinity Sensors 
- Stereo Vision Sensor 
- Ultrasonic Sensor 

- FPV Camera 


Obstacle Detection / 
Collision Avoidance 


3D Mapping 


Payload Sensors 
- Visual Camera 
- Thermal IR Camera 


- Multi- / Change Detection / 
Hyperspectral Camera Damage Assessment 
- LIDAR Sensor * 


Map Update / 
Map Generation 


Missing Person Search 


Figure 1.1: Overview of the equipment of modern high-end COTS UAVs and how it can be used to execute a variety of tasks and, in turn, facilitate disaster relief and 
search-and-rescue (SAR) missions. The depicted DJI Matrice 210v2 RTK is equipped with localization sensors like a global navigation satellite system 
(GNSS) and an inertial measurement unit (IMU) for precise navigation and geographical localization. Close vicinity sensors ensure a safe autonomous 
flight, by allowing to perform obstacle detection and collision avoidance on a high-performance on-board processing unit (PU), such as the DJI Manifold 
2-G. More high-level applications, such as 3D mapping, damage assessment and missing person search can be achieved by a variety of payload sensors. 
This work focuses on the applications of obstacle detection and collision avoidance, 3D mapping and change detection based on visual images provided 
by the stereo vision sensor as well as a visual payload camera (printed in bold). * The light detection and ranging (LiDAR) sensor is not available for 
the DJI Matrice 210. 
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disaster relief and SAR missions. For one, it is equipped with localization sensors like a global 
navigation satellite system (GNSS) and an inertial measurement unit (IMU). These provide a 
precise positioning and localization of the UAV, needed for navigation and execution of a 
planned flight path as well as providing a geographical position in case of finding missing 
persons. Moreover, in order to ensure safety during flight, different close vicinity sensors like 
a stereo vision camera and ultrasonic sensor are deployed on-board modern UAVs. In case of 
autonomous flight, which is especially important if the UAV is to be used beyond visual line- 
of-sight (BVLOS), these close vicinity sensors provide the opportunity to perform an obstacle 
detection and collision avoidance. In this, a direct on-board high-performance processing is 
crucial, which is why UAVs from the DJI Matrice series can be equipped with an on-board 
processing unit (PU), e.g. the DJI Manifold 2-G. Due to the increasing demand for deep- 
learning-based artificial intelligence (AI), most of these on-board PUs of modern UAVs are 
equipped with graphic processing units (GPUs), just as the DJI Manifold 2-G which is based 
on the NVIDIA Jetson TX2 board. To facilitate more high-level applications, the UAVs are 
equipped with appropriate payload sensors. UAVs from a lower-price segment typically only 
provide a single payload sensor that can, if at all, only be replaced by the manufacturer. High- 
end UAVs like the DJI Matrice Series provide a range of different plug-and-play sensors that 
can easily be attached to the UAV by the user. Examples of such sensors are visual cameras 
that provide high-resolution imagery, thermal infrared (IR) cameras, multi- or hyperspectral 
cameras and recently even LiDAR sensors. 


There are a large number of tasks and applications that can be executed with such sensors to 
facilitate disaster relief and SAR missions. For example, 3D mapping can be achieved from 
the camera images by means of photogrammetry or directly from the data provided by the 
LiDAR sensor. This is crucial for an effective assessment of the disaster site and for an ap- 
propriate planning and deployment of the emergency forces (Furutani and Minami 2021). In 
comparison to relying on data provided by manned aircrafts or satellites for the task of 3D 
mapping, the use of UAVs has a great advantage in terms of cost efficiency and rapid de- 
ployment (Eastern Kentucky University 2017, Furutani and Minami 2021). By Ferris-Rotman 
(2015) and DroneDeploy (2018), it is reported how COTS UAVs were used, together with a 
photogrammetric software for off-board 3D mapping, to help with assessing the aftermath of 
the 2015 Nepal and 2018 Indonesia earthquake. The images from the visual camera, paired 
with 3D data, are used to perform change detection and damage assessment as well as to gen- 
erate a current map of the disaster site. This is helpful for the assessment of the situation by 
the emergency forces. In other situations, the thermal IR camera allows to efficiently search 
for missing persons. Silvagni et al. (2017), for example, propose to use a UAV equipped with 
a thermal IR camera to find missing persons in mountainous areas, that are buried in snow 
after an avalanche. Ifa multi- or hyperspectral camera is deployed, surface structures and soil 
moisture can be monitored in order to reason on the structural integrity of collapsed build- 
ings (Ferris-Rotman 2015) or a possible dam failure after flooding or intense rainfall. Apart 
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from these applications, there are a lot more use-cases for both fixed-wing and rotor-based 
UAVs to assist disaster relief and SAR missions. They can, for example, also be used to deliver 
goods, such as medical supplies, to areas which are difficult to access or even build up radio 
relays to facilitate communication (Shakhatreh et al. 2019). 


As illustrated by the bold print in Figure 1.1, the aim of this work is to facilitate three of the 
above-mentioned applications, namely obstacle detection and collision avoidance, 3D map- 
ping and change detection, by means of image-based depth estimation from data provided 
by the stereo vision sensor and the visual payload camera. In particular, the empowerment 
and support of first responders is of great interest, with focus on fire fighters and medical 
emergency forces. This increases the need for fast and reliable processing in order to ensure 
a rapid assessment. The proposed framework relies solely on data from the visual camera as 
they are typically deployed on modern UAVs by default and are cheaper than other sensors, 
such as LiDAR sensors. In the following, the proposed framework for the support of first 
responders in the rapid structural assessment of the disaster site by means of COTS UAVs 
together with the associated challenges and the contributions of this work is presented. 


1.2 Proposed Framework, Challenges and Contributions 


In this work, a framework for the support of first responders, e.g. fire fighters and medical 
emergency forces, in the rapid structural assessment of the disaster site by means of COTS 
UAV is proposed. The idea of the covered use-case is that in case of a disaster, which can be 
of natural cause, e.g. earthquake, or induced by humans, e.g. road accident, specialized first 
responders can use a COTS UAV in combination with a ground control station (GCS) to rapidly 
assess the situation by aerial reconnaissance. In particular, fast 3D and orthographic mapping 
is of great benefit to rapidly acquire an overview of the disaster site and in turn perform route 
and mission planning for the emergency forces. Moreover, given the computed 3D data of 
the area, change detection can be performed to assess the caused damage. Apart from the 
high-level applications, a save operation of the deployed UAV is to be ensured, especially if 
it is operated autonomously, which further increases its benefit as the operating personnel 
can concentrate on the data produced by the system rather than on flying the UAV itself. 
While the actual flight can be realized by providing geographical waypoints to the flight 
controller, a dynamic obstacle detection and collision avoidance needs to be implemented for 
a safe autonomous flight. As discussed above, there exist numerous other applications, such 
as path planning or person detection, that can be executed in order to assist first responders. 
They are, however, not covered in the scope of this work. 


With this use-case and these applications in mind, a framework that consists of two parts is 
proposed, as illustrated in Figure 1.2. (i) The blue part uses the image data from the stereo 
vision sensor and performs two-view stereo depth estimation in real-time, that serves as input 
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to an obstacle detection and collision avoidance algorithm. This is done directly on board the 
UAV, utilizing the high-performance on-board PU. (ii) In the red part, the image data from the 
payload camera is streamed down to a GCS, where it is processed by more powerful hardware. 
Here, an approach on fast multi-view stereo (MVS) depth estimation and single-view depth 
estimation (SDE) is proposed to facilitate rapid 3D mapping and change detection. Details 
and realization on the data-link are not covered in the scope of this work. In the following 
two sections, the challenges associated with each of the outlined component as well as the 
contributions of this work to tackle these challenges are discussed. 


1.2.1 Two-View Stereo Depth Estimation for Obstacle Detection and 
Collision Avoidance 


As outlined above, the first part of the presented framework performs obstacle detection 
and collision avoidance based on dense depth maps which are estimated from a two-view 
camera setup. In this, the complete processing is done in real-time and on an embedded 
hardware which poses a number of challenges and restrictions. The main contributions of 
this work to tackle these challenges are presented in Chapter 2, Chapter 3 and Section 6.1 
and are summarized in the following: 


Reliable and high-accurate dense depth estimation: For safety-critical applications, 
such as obstacle detection and collision avoidance, a reliable and deterministic pro- 
cessing is of great importance. Moreover, in order to confidently detect obstacles and 
derive their location with respect to the UAV by means of stereo image processing, the 
depth maps must have a high completeness and be of metric scale. Here, the visual 
appearance of the depth map and the subpixel accuracy of the estimates play a minor 
role. Thus, this work proposes ReS*tAC (Ruf et al. 2021a), an approach which adopts a 
widely used stereo algorithm for high-performance depth estimation based on stereo 
images. It reliably computes dense depth maps that achieve state-of-the-art accuracies 
on public stereo benchmarks, with an error rate as low as 4% and a completeness of 


88 % and higher. 


Computationally efficient obstacle detection and collision avoidance: The proposed 
framework relies on the stereo vision sensor to perceive the space in front of the 
UAV and, in turn, perform obstacle detection and collision avoidance. In this, dense 
depth maps are estimated by means of stereo image processing, which has a high 
computational complexity since depth hypotheses are to be estimated for a great 
majority of pixels in order to reliably detect obstacles. This, in turn, requires the sub- 
sequent process of obstacle detection and collision avoidance to be computationally 
efficient in order to not exceed given run-time and latency constraints. Thus, this work 
demonstrates the suitability of a simple and yet effective approach (Ruf et al. 2018a) 
for obstacle detection and collision avoidance. This reprojects the depth maps into a 
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top and side view of the environment in front of the UAV, allowing to detect possible 
obstacles by means of simple clustering and thresholding operations. 


Real-time & low-latency processing: Despite the computational high complexity of esti- 
mating dense depth maps, the processing has to be done in real-time and with low- 
latency in order for the flight controller to still be able to initiate an appropriate avoid- 
ance maneuver. In this, the processing entails the estimation of the depth map, as well 
as the detection of obstacles and the calculation of the evasion maneuver. This maneu- 
ver can be as complex as recalculating the flight path in order to evade the obstacle, or 
as simple as initiating an emergency stop before calculating a new route. This work 
shows that both the approach for dense depth estimation (Ruf et al. 2021a), as well as 
the approach for obstacle detection and collision avoidance (Ruf et al. 2018a), allow 
real-time and low-latency processing by utilizing general-purpose computation on a 
GPU (GPGPU) and vectorized single-instruction-multiple-data (SIMD) processing on a 
CPU. 


On-board embedded processing: Lastly, the most challenging aspect in this is that all pro- 
cessing is to be done on an embedded device on board the UAV in order to ensure low 
latency. A communication with a GCS to allow an off-board processing would in- 
duce unnecessary latency and be subjected to additional sources of unreliability due 
to a possible data-link failure. The use of on-board embedded devices, however, is 
subjected to a reduced availability of processing capabilities. This is because a trade- 
off between low-power and high-performance processing is to be found in order to 
not excessively reduce the flight time while at the same time enabling fast processing. 
With ReS?tAC (Ruf et al. 2021a) a real-time execution on embedded CUDA and ARM 
devices, such as the NVIDIA Jetson boards is demonstrated. These embedded systems 
are increasingly deployed for on-board processing, as the development of appropriate 
algorithms is less cumbersome and time-consuming compared to the development for 
field-programmable gate array (FPGA) technology. 


The contributions presented in these chapters have been published in: 


e Ruf, B.; Mohrs, J.; Weinmann, M.; Hinz, S., and Beyerer, J. (2021a): “ReS?tAC - UAV- 
borne real-time SGM stereo optimized for embedded ARM and CUDA devices”. In: 
MDPI Sensors 21.11, 3938. Peer-reviewed on the basis of the full paper. Cited as (Ruf 
et al. 2021a). 


e Ruf, B.; Monka, S.; Kollmann, M., and Grinberg, M. (2018a): “Real-time on-board obsta- 
cle avoidance for UAVs based on embedded stereo vision”. In: International Archives of 
the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-1, pp. 363- 
370. Cited as (Ruf et al. 2018a). 
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1.2.2 Multi-View Stereo and Single-View Depth Estimation for Fast 
3D Mapping and 3D Change Detection 


The second part of the proposed framework aims at using multi-view stereo and single-view 
depth estimation for fast and online 3D mapping and 3D change detection. Even though 
it is assumed that the processing is to be done on a GCS, equipped with enough hardware 
resources, the requirements and the nature of the input data impose a number of other chal- 
lenges. These are tackled by the contributions presented Chapter 4, Chapter 5, as well as in 
Section 6.2 and Section 6.3, and are summarized in the following: 


Accurate dense 3D mapping and rapid 3D change detection: In order to facilitate a 
good assessment of a disaster site, as outlined by the considered use-case, an accurate 
and dense 3D mapping is essential. In this, for the actual representation of the area 
and, in turn, the assessment of the situation, it is not critical if the 3D map misses a 
high level-of-detail. Yet, it is of great importance that it is densely and thoroughly 
reconstructed without big holes, i.e. areas with missing data. Just as in case of software 
suites for offline reconstruction, dense 3D mapping requires the computation of dense 
depth maps by means of dense image matching (DIM) with a subsequent fusion to a 
3D point cloud or 3D model. With FaSS-MVS (Ruf et al. 2021b), this work proposes 
an approach that efficiently estimates accurate dense depth maps which have an 
absolute error of approximately only 1% with respect to the maximum depth of the 
reconstructed scene. Furthermore, this work presents how appropriate algorithms for 
dense depth estimation can be utilized in a processing pipeline for fast and online 3D 
mapping from aerial imagery captured by COTS UAVs (Hermann et al. 2021b) as well 
as for rapid 3D change detection (Ruf and Schuchert 2016). 


Efficient handling of large scene depth and oblique imagery: A large scene depth 
poses a great challenge to the estimation of dense depth maps, especially if the pro- 
cessing is subjected to run-time constraints. While images captured from a nadir 
viewpoint only have a confined depth range, which typically depends on the flight 
altitude, a straight flyover is not always possible due to numerous reasons, for example, 
because of fire and smoke or other air traffic. In such cases, the scene can only be 
captured from oblique viewpoints which, in turn, increases the scene depth and with 
it the computational complexity, as more depth hypotheses have to be generated and 
evaluated. Moreover, due to the oblique vantage points, the scenes depicted by such 
imagery are mostly comprised of non-fronto-parallel surface structures. This requires 
the consideration of slanted surfaces in the estimation of dense depth maps in order 
to not induce stair-casing artifacts in the resulting reconstruction. The approach 
for multi-view stereo (Ruf et al. 2019, Ruf et al. 2021b) presented in this work, i.e. 
FaSS-MVS, employs a hierarchical processing scheme in the estimation of dense depth 
maps and thus allows to efficiently handle large scene depths of oblique imagery by 
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making use of a coarse-to-fine strategy. Furthermore, this work proposes a surface- 
aware optimization scheme for the presented multi-view stereo approach in order to 
account for non-fronto-parallel surfaces and, in turn, reduce stair-casing artifacts in 
the reconstruction of slanted surfaces. 


Handling the existence of dynamic scene objects: When performing monocular 3D 
modeling, a key assumption is that the scene remains static while the camera moves 
around it during data acquisition. During the operation of first responders, this can 
seldom be ensured, as the emergency forces themselves or/and other persons, e.g. 
casualties, might be moving around on the disaster site. On the other hand, depending 
on the application, it might also be of benefit and of interest to model the disaster site 
in 4D. This means that moving objects should be correctly placed inside the model. 
A possible application would be to monitor the operation of the emergency forces 
in order to ensure their safety. To account for the inability of monocular multi-view 
stereo approaches to reconstruct dynamic objects, this work presents an approach 
for single-view depth estimation (Hermann et al. 2020), i.e. depth estimation from a 
single aerial image. This approach learns to predict depth maps from a single input 
image by means of a deep convolutional neural network (CNN) and is thus able to 
also predict depth maps for scenes with dynamic objects. Furthermore, it is trained 
in a self-supervised manner, not requiring any special training data apart from a 
large number of camera images, making it suitable for a great variety of domains. 
This allows to execute the presented approach for single-view depth estimation in 
parallel to a monocular multi-view stereo approach, enabling the fusion of the depth 
maps from both approaches and, thus, filling missing estimates in multi-view stereo 
depth maps with the results of the depth map from the single-view depth estimation 
approach. 


Fast and online processing: Again, in all this, the time it takes to process the data and to 
generate the desired output is not irrelevant. As the outlined use-case aims at support- 
ing first responders, the processing is to be done fast and online. This means that the 
3D mapping is to be done while the UAV is acquiring image data and streaming it to 
the GCS. In contrast to the embedded processing for obstacle detection and collision 
avoidance, however, there is little restriction with respect to the available processing 
resources. Nonetheless, fast and online dense depth estimation by means of multi-view 
stereo and a subsequent 3D mapping is still challenging, not only due to the reasons 
mentioned above, but also because not all data is available at the time of processing. In 
particular with respect to handling oblique imagery with a large scene depth, this might 
inhibit the exploitation of an appropriate baseline between the images, which is suffi- 
cient for an accurate reconstruction. Again, by employing optimizations for massively 
parallelized GPGPU, the presented approach for multi-view stereo, i.e. FaSS-MVS (Ruf 


1 Introduction 


et al. 2021b), achieves a processing rate of 1-2 Hz, which is well-suited for online pro- 
cessing of a monocular image sequence. This work further shows, that, depending on 
the image size, the presented approach for single-view depth estimation (Hermann et 
al. 2020) even reaches theoretical frame rates of up to 250 FPS, making it capable of 
real-time and low-latency processing. 


The contributions presented in these chapters have been published in: 


e Ruf, B.; Weinmann, M., and Hinz, S. (2021b): “FaSS-MVS - Fast multi-view stereo with 
surface-aware semi-global matching from UAV-borne monocular imagery”. In: arXiv 
preprint arXiv:2112.00821v1. Published as preprint. Cited as (Ruf et al. 2021b). 


+ Hermann, M.; Ruf, B., and Weinmann, M. (2021b): “Real-time dense 3D reconstruction 
from monocular video data captured by low-cost UAVs”. In: International Archives of 
the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIN-B2-2021, 
pp. 361-368. Cited as (Hermann et al. 2021b). 


e Hermann, M.; Ruf, B.; Weinmann, M., and Hinz, S. (2020): “Self-supervised learning 
for monocular depth estimation from aerial imagery”. In: ISPRS Annals of the Pho- 
togrammetry, Remote Sensing and Spatial Information Sciences V-2-2020, pp. 357-364. 
Peer-reviewed on the basis of the full paper. Cited as (Hermann et al. 2020). 


e Ruf, B.; Pollok, T., and Weinmann, M. (2019): “Efficient surface-aware semi-global 
matching with multi-view plane-sweep sampling”. In: ISPRS Annals of the Photogram- 
metry, Remote Sensing and Spatial Information Sciences IV-2/W7, pp. 137-144. Peer- 
reviewed on the basis of the full paper. Cited as (Ruf et al. 2019). 


e Ruf, B.; Thiel, L., and Weinmann, M. (2018b): “Deep cross-domain building extraction 
for selective depth estimation from oblique aerial imagery”. In: ISPRS Annals of the 
Photogrammetry, Remote Sensing and Spatial Information Sciences IV-1, pp. 125-132. 
Peer-reviewed on the basis of the full paper. Cited as (Ruf et al. 2018b). 


e Ruf, B.; Erdnüß, B., and Weinmann, M. (2017): “Determining plane-sweep sampling 
points in image space using the cross-ratio for image-based depth estimation”. In: In- 
ternational Archives of the Photogrammetry, Remote Sensing and Spatial Information Sci- 
ences XLII-2/W6, pp. 325-332. Cited as (Ruf et al. 2017). 


e Ruf, B. and Schuchert, T. (2016): “Towards real-time change detection in videos based 
on existing 3D models”. In: Proceedings of the SPIE Image and Signal Processing for 
Remote Sensing XXI, pp. 489 -502. Cited as (Ruf and Schuchert 2016). 
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2 Dense Depth Estimation by Two-View and 
Multi-View Stereo 


This chapter includes material from the following publications: 


Ruf, B.; Mohrs, J.; Weinmann, M.; Hinz, S., and Beyerer, J. (2021a): “ReS2tAC - UAV-borne real-time SGM 
stereo optimized for embedded ARM and CUDA devices”. In: MDPI Sensors 21.11, 3938. Reprinted with per- 
mission. It is cited as (Ruf et al. 2021a) and marked with a blue sidebar. 


Ruf, B.; Weinmann, M., and Hinz, S. (2021b): “FaSS-MVS - Fast multi-view stereo with surface-aware semi- 
global matching from UAV-borne monocular imagery”. In: arXiv preprint arXiv:2112.00821v1. Reprinted with 
permission. It is cited as (Ruf et al. 2021b) and marked with a green sidebar. 


The ultimate goal of computer vision is to enable computers and robotic systems to fully 
perceive and understand their surrounding by the interpretation of camera images, just like 
the human brain does with the visual information it receives from the eyes. Moreover, to be 
able to freely navigate through and interact with the three-dimensional world, it is of vital 
importance to be able to perceive and map the surrounding not only in 2D but also in 3D. In 
the process of image acquisition, however, the three-dimensional world gets projected onto a 
two-dimensional plane, discarding one dimension and loosing vital information on the depth 
of the scene, which is important to be able to reconstruct the world or objects therein in 
3D. Thus, an essential task for robotic vision is to estimate the depth of the scene and, in 
turn, to reason about the 3D geometry. If, due to different constraints, no active sensing (e.g. 
LiDAR systems) is available, this has to be done by solely relying on 2D camera vision. With 
enough training, us humans, and also modern machine learning algorithms, are able to learn 
how to predict a sense of scene depth from single images. This prediction, however, is only 
based on the experience and knowledge gained in life or during training, respectively. It can 
fail, if there is not enough contextual information available in the image, or if the scene that 
is depicted is unknown or difficult to grasp. Moreover, in such a case, the predictions on 
sizes and distances that are made are only relative, i.e. how big is one object with respect to 
another, or how far away is a certain scene point relative to a reference. 


However, the scene depth can be accurately reconstructed if not one, but at least two cameras 
are available and both are positioned in such a way that they observe the same scene at 
the same time from two slightly different vantage points. Just like with the two eyes of us 
humans, the slight difference in the arrangement of the scene observed in the images of the 


11 


Figure 2.1: Illustration of a stereo camera setup. Two cameras placed apart from each other with a distance b (baseline) observe the same scene from two different 
vantage points. A scene point M is projected onto different locations mt and m in the two camera images J; and Ir. If the image pair is rectified, the 
difference between m! and m? is reduced to a horizontal shift, the so-called disparity Au = [mi _ mk |. (Ruf et al. 2021a, Fig. A1.) 
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two cameras can be used to reason about the distance of the depicted object with respect to 
the camera setup. This is referred to as stereoscopic vision, or in short stereo. As illustrated 
in Figure 2.1, two cameras placed apart from each other with a distance b (baseline) observe 
a scene point M from two different vantage points. The scene point will appear at different 
locations m" and m? in the two camera images J, and Jg. Finding such correspondences is 
the essential task in the process of stereo processing, and in turn dense depth estimation. 
(Ruf et al. 2021a, Appendix A.) 


Generally speaking, a correspondence search between an image point in the one camera im- 
age and the image projection of the same scene point in the other camera image has to be 
conducted along the epipolar curve, which is the viewing ray going through the image point 
of the first camera projected into the image of the second camera. This correlation is formu- 
lated by the epipolar constraints (cf. (Hartley and Zisserman 2004, Sec. 9.1.)). If the camera 
rig is not calibrated, the two image planes are not rectified, i.e. the image planes are not co- 
planar. This is illustrated by the light gray boxes in Figure 2.1. When the intrinsic parameters 
of the cameras are known, it is possible to account for image distortions, transforming the 
curve into an epipolar line. Furthermore, if the relative position and orientation between the 
two cameras of the stereo rig are known, the camera images can be rectified, i.e. they can be 
transformed onto a common image plane, and in turn the epipolar lines can be transformed to 
coincide with the image row of the image points. Thus, the difference between the image po- 
sitions of m! and m? is reduced to a horizontal shift, the so-called disparity Au = |m}, — mi. 
(Ruf et al. 2021a, Appendix A.) 


From this, the depth d of the scene point, i.e. the orthogonal distance from the optical center 
of the reference camera, can be deduced according to: 


fo. (2.1) 
Au-h, 

In this equation, f denotes the actual focal length of the optics, while h, states the width of 
a pixel on the sensor chip of the camera, both given in millimeters (mm). Together with the 
baseline b, which is typically also given in a metric system, these additional parameters allow 
to convert a disparity map, given in pixels, into a depth map with a metric scale. In case the 
intrinsic parameters of a camera are not known a priori and thus are calibrated with images of 
a calibration pattern, the focal length and the size of the sensor element are often combined 
into a single, two-dimensional parameter, ie. f, and fọ. These are denoted as the camera 
constant in u- and v-direction. They are located in the diagonal of the intrinsic camera matrix 
and are usually denoted as the mathematical focal length of the pinhole camera system. 


In their work, Scharstein and Szeliski (2002) have studied and categorized a number of stereo 
algorithms for dense disparity or depth estimation based on their processing steps. From 
their observations, they have derived a general processing pipeline, which provides a basic 
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Figure 2.2: General processing pipeline for the task of dense disparity or depth estimation based on two-view stereo. 
Given two input images J, and Jp, a dense image matching and cost computation is first executed, 
yielding a three-dimensional cost volume S. Then, a cost optimization and regularization are performed, 
transforming the raw cost volume S into a regularized cost volume S. In the last stage, a disparity map U 
or depth map D is extracted from S as well as refinement and post-processing operations are executed. 


blueprint for most modern stereo algorithms, as illustrated in Figure 2.2. In this, the estima- 
tion of dense disparity maps can be subdivided into three subsequent steps. Given two input 
images J, and Jp, a dense image matching and cost computation is first executed, yielding a 
three-dimensional cost volume S. Then, a cost optimization and regularization are performed, 
transforming the raw cost volume S into a regularized cost volume S. In the last stage, a dis- 
parity map U or depth map D is extracted from S as well as refinement and post-processing 
operations are executed. (Ruf et al. 2021a, Appendix A.) 


The following sections give a brief overview on each individual step and its specific task 
within the processing chain. While the processing pipeline is proposed for the disparity and 
depth estimation from two images, it can also be adopted for the use of three or more input 
images. The most significant change in this adaptation is done with respect to the dense 
image matching and cost computation, which is discussed in the next section. 


2.1 Dense Image Matching and Cost Computation 


Finding two corresponding image points, that depict the same scene point in both images 
of a stereo camera setup, means to match the pixels of the reference image, typically the 
image of the left camera, against each pixel in the matching image within a certain disparity 
range Au € D, = [Au,,;,,AUmax]. In this, the similarity between two pixels is modeled by a 
similarity measure from which a cost function C(-) can be deduced, which typically is at its 
minimum if both pixels coincide. This so-called matching cost between a pixel pt = (u, v) in 
the left image and a corresponding pixel p®, located according to a disparity shift Au in the 
right image, is stored in the three-dimensional cost volume S: 


S(p}, Au) = c(I,u, v), RU — Au, v)). (2.2) 


Thus, the objective in finding two corresponding image points is to minimize the matching 
cost computed by C(-). (Ruf et al. 2021a, Sec. 2.1.1.) 
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When relying on distinctive image features like SIFT (Lowe 2004), SURF (Bay et al. 2006) 
or ORB (Rublee et al. 2011) features, a unique matching between two image points can be 
found. However, such image features can only be calculated in descriptive image regions, 
resulting in a sparse correspondence field. Thus, in the case of dense image matching (DIM), 
Equation (2.2) is evaluated for each pixel in the reference image, computing a pixel-wise 
matching cost s according to C(-), which indicates the similarity between the pixel pair. This 
cost is then stored inside a three-dimensional cost volume S, from which the disparity map 
is later extracted. (Ruf et al. 2021a, Sec. 2.1.1.) 


Since the disparity is only evaluated along the same pixel row, it is assumed that the input 
images J, and Jp are rectified prior to the process of image matching. The OpenCV’ library 
provides a standard calibration routine (Zhang 2000), which can be used to calculate two 
rectification maps that allow to efficiently resample the input images such that the epipolar 
lines lie horizontally on the image rows. (Ruf et al. 2021a, Sec. 2.1.1.) 


Two prominent and widely used similarity measures for DIM are the census transform and 
the normalized cross correlation. Both use a support region of fixed size around a reference 
pixel, which is typically situated in the center of the support region. Thus, the support region 
typically has an uneven side length. The computation of the census transform and normalized 
cross correlation as well as the corresponding cost functions are described in Section 2.1.1 
and Section 2.1.2, respectively. 


2.1.1 Census Transform and its Hamming Distance 


First proposed by Zabih and Woodfill (1994), the census transform (CT) is often used as a 
similarity measure for efficient and real-time DIM, due to its low computational complexity. 
Given a support region of fixed size of mx n pixels, each pixel of a grayscale image within the 
support region is converted into a bit-string of length m - n. This is done by comparing the 
intensities of each neighbor pixel q to the intensity of the reference pixel p at the center of the 
support region and setting the according bit to 1, if the intensity of q is less than that of p: 


0, if J(q) = I(p) 


CED Sh ap Ig) < Ip). 


(2.3) 


The similarity between two image patches is then calculated by computing the Hamming 
distance between the corresponding bit-strings, i.e. by counting how many bits differ between 
the two strings. In this, it is assumed that the lower the Hamming distance is, the more similar 
the two corresponding image patches are. While the CT is very robust to local illumination 
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changes, it is not very descriptive, since it reduces the actual image content to a bit-string, 
which only encodes the relative intensities within a confined support region. 


2.1.2 Truncated and Scaled Normalized Cross Correlation 


Compared to the CT and its Hamming distance, the normalized cross correlation (NCC) is 
much more descriptive and unique in the matching of two patches with a size of mx n pixels. 
However, the computation of the NCC is significantly more complex than the computation 
of the CT. The NCC between the image patches around two pixels pt and pÈ is computed 
according to: 


Panan ( (A) — PE) - (A) — PR) ) 


Zigeri (a) - pL)” ` Zager (Irlar) - PR)” 


with Pl and PR being the image patch around p! and p®, respectively, and with Pl and PR 


NCC (pt, p?) = (2.4) 


being the mean intensity values of the corresponding image patches. The NCC can geomet- 
rically be interpreted as the dot product between the vectors of the intensity values corre- 
sponding to the pixels’ zero-mean image patches. Thus, it is at its maximum, ie. 1, if both 
patches are equal. Since the range of the NCC is [-1,1], it is typically truncated and scaled 
when used as a cost function in the process of DIM (Sinha et al. 2014, Scharstein et al. 2017): 


Cncc (pt, p?) = (1 — max (0,NCC(p, p®))) - 255. (2.5) 


2.1.3 True Multi-Image Matching by Means of Plane-Sweep 
Sampling 


As discussed above, the epipolar constraint formulates the relationship between an image 
point m! in the one camera image and the epipolar line IX, on which the corresponding image 
point m? in the other camera image lies. Here, both image points depict the same scene point. 
This correlation is manifested in the so-called fundamental matrix F: 


R =F. mi. (2.6) 


A similar characteristic exists for the three- and four-view geometry. Analogously to the 
fundamental matrix, the trifocal and quadrifocal tensors allow to establish point-line corre- 
spondences between three and four views, respectively. Due to the increasing complexity in 
the computation of these tensors with an increasing number of views, their practical use does 
not go beyond four views (Hartley and Zisserman 2004). (Ruf et al. 2021b, Sec. 2.2.) 
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Algorithm 2.1: The basic plane-sweep algorithm for true multi-image matching. 


Input Data: multiple images J, with a pre-selected reference image Jef, corresponding 
camera calibration matrices Ką and camera poses P, as well as a set of 
planes Iy with a normal vector Ny and distances ô € [din Omax |: 

Result: cost volume S, holding the matching cost for all pixels p € J,e, matched 

according to IT € Ily. 


foreach plane II € II, do 
- warp all matching images J; \ Jeg according to H (II, Pree, pP)" into the view ofthe 
reference camera, yielding fk. 
foreach pixel pf = (u,v) € J,.r do 
- compute matching costs between reference image and warped matching 
images: 


sp, ID) = > (Ira, v), Tu, v)). 
k 


end 


end 


Besides the fundamental matrix, there exists a second projective relationship between two 
views, which can be extended equally to an arbitrary number of views. Again, the case of 
a two-camera setup with known intrinsic and extrinsic parameters is considered. This time, 
however, an additional scene plane II = (Ny, ô) is positioned in the field-of-view (FOV) 
of both cameras. Here, the scene plane II is parameterized by its normal vector Ny and 
its distance 6 from the first camera, ie. the reference camera. Given this setup, the image 
point m" in one camera image can directly be mapped onto the image point m? in the second 
camera image via the homography H induced by the plane II according to 


mt = H (II, P,, Pk) -mt , with 
—T-NI 2,7 
RATNn ood (2.7) 


H (I, P, Pk) = Kr: —3— L: 


Here, K; and Kg denote the intrinsic matrices of both cameras, and [R T] denotes the relative 
transformation matrix of the pose Pr of the second camera with respect to the pose P, of the 
first camera. Equation (2.7) is geometrically interpreted by casting a viewing ray through 
the image point m! and intersecting it with the scene plane II, yielding a scene point M, 
which is then projected into the second camera, resulting in the image point m® (Hartley and 
Zisserman 2004). (Ruf et al. 2021b, Sec. 2.2.) 


Based on the relationship between two cameras and a scene plane, Collins (1996) proposed 
an algorithm for true multi-image matching (cf. Algorithm 2.1). This algorithm samples the 
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scene-space between two bounding planes Il nax and IL in, located at max and Âmin, by sweep- 
ing a plane along its normal vector Ny through space and matching the pixels inside the input 
images according to Equation (2.7) for each distance ô € [dyin» Ômax] of the plane relative to 
the reference camera. For each position of the plane, an arbitrary number of matching im- 


-1 
ref> 


ages are warped by the plane-induced homography H,., ,, into the view of the reference 
camera, where they are matched against the reference image. If the scene plane is close to a 
three-dimensional structure, then the corresponding image regions of the warped matching 
images overlap with those in the reference image, allowing to deduce the scene depth of the 
corresponding object from the parameterization of the corresponding plane. Initially denoted 
as space-sweep algorithm, it was adopted by numerous studies on multi-image matching and 
MVS (Gallup et al. 2007, Pollefeys et al. 2008, Sinha et al. 2014), eventually denoting it as 


plane-sweep algorithm. (Ruf et al. 2021b, Sec. 2.2.1.) 


An overview on the basic plane-sweep algorithm for true multi-image matching is given 
in Figure 2.3 and Algorithm 2.1. Moving from a two-camera setup to an arbitrary number 
of cameras, it is not distinguished between left and right image anymore, but between the 
reference image J,e and the matching images Jk \ Jret. Moreover, when using the plane- 
sweep algorithm, the disparity Au in the image matching is substituted by the plane distance 
6. Given a plane parameterization, the depth can be deduced by intersecting the viewing ray 
through pixel p = (u v)! with the corresponding plane: 
-Ö 


d, ee mae RESET (2.8) 
P & 
NL K1. (uv1) 


2.2 Cost Optimization and Regularization 


Given the previously computed three-dimensional cost volume S, the next step consists of 
extracting a plausible disparity map U. Since each voxel of S holds the matching cost of a 
particular pixel p of the reference image, associated with a certain disparity Au within the 
studied disparity range I, a straight-forward approach to compute a disparity map from the 
cost volume is to take the winner-takes-it-all (WTA) solution for all pixels inside the refer- 
ence image: 


U(p) = arg min S(p, Au). (2.9) 
Auer y, 


However, due to the limited descriptiveness of the cost functions and resulting ambiguities in 
the cost volume, this would lead to a noisy and unsuitable disparity map. Thus, it is important 
to perform a cost optimization and, in turn, regularization of the cost volume. (Ruf et al. 
2021a, Sec. 2.1.2) 
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Scharstein and Szeliski (2002) have categorized the stereo methods according to their cost 
optimization strategies into local and global methods. While the first group of algorithms 
only optimize the cost volume in a locally confined window and make implicit smoothness 
assumptions, global methods explicitly state their regularization scheme and perform an op- 
timization within the whole image domain. Global methods will thus produce more accurate 
disparity maps compared to those estimated by local methods. Yet, at the same time, not all 
global methods are feasible for fast and embedded processing due to their complexity. (Ruf 
et al. 2021a, Sec. 2.1.2.) 


2.2.1 Semi-Global Matching 


In their taxonomy, Scharstein and Szeliski (2002) additionally propose a third group of al- 
gorithms, which utilize dynamic programming to compute a disparity map. Algorithms in 
this group usually state an explicit smoothness assumption and approximate a global opti- 
mization scheme, which is why this group can be considered as a subgroup of the global 
methods. The most prominent algorithm, especially for real-time processing, is the so-called 
Semi-Global Matching (SGM) algorithm (Hirschmüller 2005, Hirschmüller 2008). Here, the 
optimization scheme is formulated as a 2D Markov random field (MRF) and the minimization 
of the energy function E: 


200 =) (SB. Ue) + X lU- UC] = 1] 
p qERp 


+D p Up) -o> 1), 


JER, 


(2.10) 


with an explicit smoothness assumption, which penalizes the disparity deviation within the 
support region R, of pixel p by pı and g, according to the magnitude of disparity differences, 
with [-] denoting the Iverson bracket. As illustrated by Figure 2.4, the minimization of E is 
approximated by aggregating the matching costs along a number of concentric paths for 
each pixel p within the image domain: 


L,(p, Au) = S(p, Au) + min (LẸ — r, Au’) + V (Au, Au’)), with 
u! 


0, if Au = Au’ (2.11) 
V(Au, Au’) = 4 9), if [Au — Au’| = 1 
$2, if |Au — Au'| > 1. 


In L,, for each pixel p and disparity Au inside the cost volume S, the matching costs are recur- 
sively summed up while moving along the path with the direction r. Within the smoothness 
term V(Au, Au’), the matching costs of the previously considered pixel with respect to all 
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Figure 2.4: Illustration of the path aggregation within SGM to regularize the three-dimensional cost volume. For 
each pixel p, the cost volume S is optimized along eight concentric paths L,. 


evaluated disparities u’ are penalized with | or 9, according to the disparity difference be- 
tween Au and Au’. Finally, for each pixel, all path costs are summed up and stored inside 
the aggregated cost volume S: 


S(p, Au) = }) L,(p, Au), (2.12) 


before extracting the WTA disparity map according to Equation (2.9) from $. (Ruf et al. 
2021a, Sec. 2.1.2.) 


The use of dynamic programming, i.e. breaking down the minimization problem of the en- 
ergy function into the aggregation of independent one-dimensional paths, makes the SGM 
approach well-suited for massively parallel computing and vector processing, and in turn for 
real-time and even embedded processing. At the same time, many studies have shown that 
the results of the SGM algorithm are very accurate, making it one of the most widely used 
algorithms for real-time and accurate stereo processing. (Ruf et al. 2021a, Sec. 2.1.2.) 


2.3 Refinement and Post-Processing 


There are a number of post-processing steps, i.e. filtering, regularization and further opti- 
mizations, which can be applied to the initial disparity map or depth map in the final stage 
of the processing pipeline. Typically implemented are the subpixel disparity or depth refine- 
ment, outlier filters, as well as geometric consistency checks for occlusion detection and final 
outlier filtering. (Ruf et al. 2021a, Sec. 2.1.3.) 
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2.3.1 Disparity or Depth Refinement 


The initial disparity map U computed by the SGM optimization is made up of discrete dispar- 
ity values, which is sufficient for a diversity of robotic applications such as the perception of 
the surrounding and obstacle detection. However, when the aim is to accurately reconstruct 
the scene, it is important to also account for slanted surfaces and thus incorporate a subpixel 
refinement of the disparity or the depth. A simple and yet effective way to implement such 
a refinement is to use the minimum matching cost for each pixel, i.e. the matching cost cor- 
responding to the WTA disparity Aü, as well as the matching costs of the two neighboring 
disparities in front and behind of Aü, and fit a parabola through these three matching costs. 
The location of the minimum of this parabola with respect to Aù is then considered as the 
subpixel refinement and added to Avi. This optimization has only a minor computational 
overhead and is thus well-suited for real-time processing. However, it requires to work with 
floating point arithmetic, which might be of restriction to some embedded hardware. (Ruf 
et al. 2021a, Sec. 2.1.3.) 


2.3.2 Image-Based Filtering 


Before a disparity or depth map is finalized for further processing, a simple filtering is typi- 
cally performed in order to regularize the measurements in a local neighborhood or remove 
still remaining outliers. In the scope of this work, a median filter (Section 2.3.2.1) as well as 
a difference-of-Gaussian filter (Section 2.3.2.2) is employed. 


2.3.2.1 Median Filter 


A commonly used approach for noise reduction in disparity or depth maps is the employment 
of a local median filter in order to remove small outliers within a confined support region. In 
this, for each pixel p within the disparity or depth map, all estimates inside a local support 
region around p are first sorted according to their estimated value. Then, the estimate that is 
located at the center of the sorted vector, i.e. the median, is assigned to p. The median filter 
is computationally very efficient. The computationally most demanding part is the sorting 
of the estimates, which only poses a problem when data cannot be randomly accessed and 
should be processed in order, e.g. in case the filter is optimized for vectorized processing 
on CPUs or FPGAs. However, this can be remedied when utilizing the concept of sorting 
networks as described in Section 3.3.2.4. 
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Algorithm 2.2: The difference-of-Gaussian filter to invalidate all image pixels belonging 
to weakly-textured areas. (Ruf et al. 2021b, Alg. 3.) 
Input Data: unfiltered disparity or depth map (U or D), as well as corresponding 
reference image Jrer- 
Result: filtered U or D, in which all estimates corresponding to weakly-textured areas 
in J,.¢ are removed. 


- use a Gaussian filter with a kernel of m x m pixels to smooth the reference frame Jef, 
yielding Jsm°oth, 

- compute the difference-of-Gaussian (DOG) image depicting local image gradients 
according to: JDOG = Jep — Jsmooth, 

- apply a binary threshold to compute the DOG mask M°® marking all image areas in 
which the intensity change is greater than 0.5. 

- remove activation areas smaller than « pixels in MPOS by applying a speckle filter. 

- dilate MP with a kernel size of n x n pixels to fill small holes in activation areas. 

- remove deactivation areas smaller than £ pixels by applying a speckle filter to the 
inverted DOG mask MPOC*™Y = 1 — MPS, 


- invalidate pixels in U or D for which MP = 1. 


2.3.2.2 Difference-of-Gaussian Filter 


As proposed by Wenzel (2016), the difference-of-Gaussian (DOG) filter allows to remove esti- 
mates by masking out pixels in image regions that only provide little textural information (e.g. 
unsharp or overexposed areas). Here, it is assumed that in such regions the image matching is 
ambiguous and that it leads to less accurate results. The DOG filter is used to detect weakly- 
textured areas inside the reference image J,.¢ and construct a binary image mask, which, in 
turn, is used to remove the estimates from the corresponding maps. Algorithm 2.2 provides 
an overview on the implementation of the DOG filter used in the scope of this work, which 
is similar to the one proposed in (Wenzel 2016). (Ruf et al. 2021b, Sec. 2.6.1) 


2.3.3 Geometric Consistency Check and Occlusion Detection 


Given two or more disparity or depth maps, a consistency check can be performed in order 
to filter out estimates that are geometrically inconsistent or occluded, and thus not reliable. 
Section 2.3.3.1 gives an overview on the left-right consistency check, which is typically em- 
ployed when estimating a disparity map from a two-view stereo setup and allows to detect 
occluded areas. In Section 2.3.3.2, a geometric consistency check is introduced, which relies 
on multiple depth maps from different views and allows to filter estimates which are incon- 
sistent between the different views or between different time steps, if an image sequence is 
considered as input. 
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2 Dense Depth Estimation by Two-View and Multi-View Stereo 


2.3.3.1 Left-Right Consistency Check 


The existence of occluded pixels that arise from areas, which are observed by one camera and 
yet occluded in the field-of-view of the other camera, is inherent to disparity and depth maps 
that are computed from a conventional stereo setup consisting of only two cameras. A typical 
approach to detect and filter such pixels is the left-right consistency check. As illustrated 
by Figure 2.5, the disparities that are stored in the disparity map of the left image U! are 
compared with the corresponding disparities in the right disparity map UÈ and invalidated 
if they differ by a certain threshold, usually 1: 


Up), if |U p) - Upu — UP), Pv)| <1 


0, otherwise. 


up) = (2.13) 


(Ruf et al. 2021a, Appendix B.) 


While the occlusion detection using the left-right consistency check is very effective, it re- 
quires the computation of a second disparity map UÈ, which corresponds to the right image 
of the stereo pair. A straight-forward approach to compute U® would be to swap and hori- 
zontally flip the input images and repeat the image matching, cost optimization and disparity 
computation as described above. This, however, would mean to execute the first and compu- 
tationally most expensive steps of the processing pipeline twice for each image pair. Yet, the 
computation of WÈ can be efficiently approximated by reusing the aggregated cost volume 
$ from the cost optimization step: 


U(p) = arg min S((p,, + Au, py), Au). (2.14) 
Auel', 


(Ruf et al. 2021a, Sec. 2.1.3.) 


Aut 
om ne 
Au! = Ub(uv) Aut = uR(u — Au!,v) UR(u,v) 
ue uk 


Figure 2.5: Illustration of the left-right consistency check to find occluded image regions when estimating a disparity 
map from a two-view stereo setup. Given the disparity maps U} and UR for the reference and matching 
image, respectively, a consistency check according to Equation (2.13) can be performed in order to find 
occlusions. (Ruf et al. 2021a, Fig. A2.) 
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2.3 Refinement and Post-Processing 


2.3.3.2 Multi-View Consistency Check 


When multiple depth maps Dk with corresponding projection matrices P, are available, for 
example when performing reconstruction by MVS or when considering an image sequence as 
input and a temporal consistency is to be established, a geometric consistency check can be 


rf of a selected ref- 


performed by relying on the mutual reprojection error. In this, each pixel p 
erence depth map D,., having a depth estimate a is projected into the view another depth 
map D,, according to dg“ and the corresponding projection matrices P,e and Pù, yielding 
the image point p*. Given p* and the corresponding depth de from Dk, the image point 
p“ is projected back into the view of D,,¢, resulting in p'. Finally, if the Euclidean distance 
between p* and p" exceeds a given threshold 7,, the estimate at p" is invalidated. (Ruf 


et al. 2021b, Sec. 2.6.2.) 


Schönberger et al. (2016) formulate this reprojection error for pixel p in a reference view and 
a neighboring map k as e*(p) = Ip - Hk -H,: p|, with H, denoting the forward projection 
into the view k according to dë, and Hk representing the corresponding backward projection 
according to ar (cf. Equation (2.7)). In both cases, a fronto-parallel plane orientation with 
Ny = (00-1)' is assumed. Apart from thresholding the reprojection error e*, the number of 
neighboring views for which the reprojection error is within the threshold, i.e. the number of 
hits, are counted: €,(p) = >, lek) < 7,], with [-] being the Iverson bracket. This approach 
can be adopted to perform a geometry-based filtering between a number of depth maps within 
a sliding window Y. Algorithm 2.3 gives an overview on the geometric consistency and the 
according filtering of the depth map. (Ruf et al. 2021b, Sec. 2.6.2.) 


Algorithm 2.3: The geometric consistency check for multiple depth maps from different 
views. (Ruf et al. 2021b, Alg. 4.) 
Input Data: depth maps Dk within a sliding window Y, as well as corresponding 
projection matrices P;. 
Result: filtered Def of the reference view, in which all estimates that are geometrically 
not consistent are removed. 


- select D,.r corresponding to the center-most view within the sliding window W. 
foreach pixel p'“ € D, and neighboring view k € Y do 
- calculate the number of hits for which the reprojection error is below a threshold 
Nr: 
en(p) = Y,lek(p) < n,], with e¥(p) = |p-HX-H, - pl- 
k 
- if the hit-count is below a threshold 7}, i.e. €p < Np, invalidate pixel p in D ef, by 
setting it to 0. 
end 
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3 Real-Time Stereo Disparity Estimation on 
Embedded On-Board Devices 


This chapter includes material from the following publication: 


Ruf, B.; Mohrs, J.; Weinmann, M.; Hinz, S., and Beyerer, J. (2021a): “ReS2tAC - UAV-borne real-time SGM 
stereo optimized for embedded ARM and CUDA devices”. In: MDPI Sensors 21.11, 3938. Reprinted with per- 
mission. It is cited as (Ruf et al. 2021a) and marked with a blue sidebar. 


As outlined in Section 1.2, the first part of the proposed framework aims at facilitating a safe 
and autonomous use of COTS UAVs by means of real-time on-board obstacle detection and 
collision avoidance. In order to perceive their surroundings, modern UAVs are often equipped 
with a range of different sensors, e.g. ultrasonic and stereo vision sensors. In the scope of 
this work, the stereo vision sensor is used to compute a dense disparity map that serves as 
an input to the algorithm for obstacle detection and collision avoidance (cf. Section 6.1). 
Compared to active sensors like LiDAR scanners, camera systems in combination with state- 
of-the-art algorithms are typically more practical in performing these tasks, especially in 
terms of costs, weight and power consumption. Moreover, such stereo vision sensors are often 
already integrated in COTS UAVs. On the other hand, while a LiDAR sensor directly provides 
data on the 3D geometry of the scene, using a stereo camera for the same task requires to 
process the stereo image data and perform a disparity or depth estimation (Nex and Rinaudo 
2011). This, in turn, requires a high-performance embedded processing on-board the UAV. 
(Ruf et al. 2021a, Sec. 1.) 


For a long time, so-called FPGAs were the only processing hardware, that were capable of 
high-performance computing, while at the same time preserving a low power consumption, 
essential for embedded systems. In recent years, however, the availability of embedded GPUs, 
such as the NVIDIA Tegra, allows for massively parallel embedded computing on graphics 
hardware, which is typically more flexible than FPGAs and less cumbersome to implement. 
Furthermore, with the increasing use of deep learning for a wide range of applications, the im- 
portance and availability of embedded GPUs have grown even more. With the Jetson boards, 
comprised of an embedded ARM CPU and an embedded Tegra GPU, NVIDIA provides a suit- 
able alternative to FPGAs for embedded high-performance computing. Especially, since these 
systems are recently also being integrated in COTS UAVs, such as the DJI Matrice series in 
combination with the DJI Manifold or the UVify IFO-S UAV. (Ruf et al. 2021a, Sec. 1.) 
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3 Real-Time Stereo Disparity Estimation on Embedded On-Board Devices 


3.1 Contributions and Outline 


In this chapter, an approach for real-time embedded stereo processing on ARM- and CUDA- 
enabled devices is presented. It is based on the well-known and widely used SGM algorithm 
(cf. Section 2.2.1) first proposed by Hirschmüller (2005) and Hirschmüller (2008), and it is 
further denoted as Real-Time SGM Stereo Optimized for Embedded ARM and CUDA Devices 
(ReS*tAC). Its main contributions are: 


e the optimization of the SGM algorithm for embedded CUDA GPUs, such as the 
NVIDIA Tegra, by utilizing massively parallel computing, 


e the use of the NEON intrinsics to optimize the algorithm for vectorized SIMD 
processing on embedded ARM CPUs, and 


e the deployment on the DJI Manifold 2-G attached to a DJI Matrice 210v2 RTK UAV 
and a use-case-specific evaluation with respect to accuracy, processing speed and 
power consumption. 


Even though ReS?tAC is deployed and tested for real-time processing on board a UAV, it is 
also suitable for other embedded systems, such as those deployed on ground-based robots or 
those used in an advanced driver assistance system (ADAS). (Ruf et al. 2021a, Sec. 1.) 


The contributions presented in this chapters have been published in: 


e Ruf, B.; Mohrs, J.; Weinmann, M.; Hinz, S., and Beyerer, J. (2021a): “ReS*tAC - UAV- 
borne real-time SGM stereo optimized for embedded ARM and CUDA devices”. In: 
MDPI Sensors 21.11, 3938. Peer-reviewed on the basis of the full paper. Cited as (Ruf 
et al. 2021a). 


In the following, the related work on embedded stereo processing utilizing embedded FPGA, 
GPU or CPU hardware is briefly summarized in Section 3.2. Here, it is also pointed out 
how ReS?tAC differs from the approaches found in literature. In Section 3.3, the processing 
pipeline of ReS?tAC is outlined with detailed descriptions on the optimization for parallel 
processing on CUDA-enabled GPUs and on the vectorized SIMD processing with NEON in- 
trinsics for ARM CPUs in Section 3.3.1 and Section 3.3.2, respectively. The experimental 
results are presented and discussed in Section 3.4 and Section 3.5, respectively. A summary 
and concluding remarks are finally provided in Section 3.6. 


3.2 Related Work 


The following sections summarize the related work on embedded stereo processing. Sec- 
tion 3.2.1 looks into studies that have deployed stereo algorithms on FPGA hardware. This is 


28 


3.2 Related Work 


followed by an overview on the emergence of embedded GPU hardware for real-time stereo 
processing in Section 3.2.2. Lastly, in Section 3.2.3, the related work on deploying real-time 
stereo processing on CPU hardware, both for high-end desktop and embedded environments, 
is discussed. In addition, it is pointed out how ReS2tAC differs from the related work utiliz- 
ing embedded GPU and CPU hardware for real-time stereo processing. A relevant excerpt of 
the related work on real-time SGM stereo processing on FPGA, GPU and CPU hardware is 
summarized in Table 3.1. (Ruf et al. 2021a, Sec. 1.2.) 


3.2.1 Embedded Stereo Processing on FPGAs 


The use of FPGAs is key to achieve high-performance image processing with minimal power 
consumption, especially when relying on computationally expensive algorithms. Thus, most 
implementations of stereo algorithms for embedded systems, in particular of the SGM algo- 
rithm (Hirschmiiller 2005, Hirschmiiller 2008), are based on FPGA technology. First opti- 
mizations of the SGM algorithm, such as those presented in (Gehrig et al. 2009, Banz et al. 
2010), were deployed on a PCIe-FPGA card inside a conventional PC or on a separate devel- 
opment kit, achieving real-time frame rates of 27 FPS and 30 FPS on low-resolution imagery, 
i.e. images with a size of 320 x 240 pixels and 640 x 480 pixels, respectively. Due to ongo- 
ing technological advancements, the implementation of Wang et al. (2015), deployed on an 
Altera Stratix-IV FPGA-Board, already achieved a frame rate of 67 FPS on images with a size 
of 1024 x 768 pixels in 2015. However, besides the dedicated and specialized processing of 
a specific task, typical characteristics of embedded systems are a small form-factor and the 
integration in larger systems or cooperative environments. (Ruf et al. 2021a, Sec. 1.2.1.) 


In their work, Schmid et al. (2013) have deployed the implementation of Gehrig and Rabe 
(2010) on a small rotor-based UAV for stereo-vision-based navigation achieving 14.6 FPS ona 
Spartan 6 FPGA. Further system-on-a-chip (SOC) developments with respect to size and per- 
formance allowed to deploy computationally expensive algorithms on increasingly smaller 
systems with higher performance. Honegger et al. (2014) implemented the SGM algorithm 
on a small SO-DIMM sized SOC equipped with a Xilinx Artix7 FPGA and reaching 60 FPS 
with a frame size of 753 x 480 pixels. By reducing the frame size to 320 x 240 pixels, the 
implementation of Barry et al. (2015) reached 120 FPS and was used to navigate a small and 
fast flying fixed-wing UAV around obstacles. (Ruf et al. 2021a, Sec. 1.2.1.) 


A number of recent studies (Hofmann et al. 2016, Rahnama et al. 2018a, Zhao et al. 2020) have 
shown, that further optimizations, such as reducing the number of processing paths in the 
SGM optimization or increasing parallelization by splitting the input images in independent 
stripes, as well as the technological advancements, allow to reach frame rates of over 100 FPS, 
while at the same time increasing the accuracy of the stereo algorithm by using a higher im- 
age resolution and reducing the form-factor, leading to a reduced power consumption of the 
SOC. Yet, the use of FPGAs for real-time embedded image processing involves a cumbersome 
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and time-consuming development, optimization and deployment process. In order to reduce 
development costs of such systems, substantial effort is done to enhance the process of high- 
level synthesis (HLS) and, in turn, facilitate the development of algorithms for FPGAs with 
more high-level languages such as C/C++ (Ruf et al. 2018a, Kalb et al. 2019, Zhao et al. 2020). 
(Ruf et al. 2021a, Sec. 1.2.1.) 


3.2.2 On the Emergence of Embedded Processing on GPUs 


The development cycles for implementing and optimizing image processing algorithms for 
massively parallel processing on GPUs, on the other hand, are much shorter and thus less 
expensive. In addition, GPUs provide a much higher processing power, ideal for algorithms 
with high computational effort, such as stereo image processing. Early works (Rosenberg 
et al. 2006, Ernst and Hirschmüller 2008) have utilized the rendering pipeline of OpenGL to 
deploy the SGM stereo algorithm on graphics hardware and reached frame rates of up to 
8 FPS and 4 FPS on VGA image resolution, respectively. With the introduction of the CUDA- 
API in 2007, the development costs for general-purpose computation on a GPU (GPGPU) 
have dropped even more. And so, a lot of implementations of the SGM algorithm for real- 
time stereo processing on hardware without embedded constraints have been optimized and 
deployed on graphics hardware (Banz et al. 2011, Michael et al. 2013, Hernandez-Juarez et al. 
2016). Hernandez-Juarez et al. (2016) show that, with increasing computational power, the 
use of GPGPU on modern, high-performing graphics hardware, such as the NVIDIA Titan X, 
allows to reach frame rates of up to 237 FPS on VGA image resolution with the conventional 
use of eight optimization paths inside the SGM optimization. Even higher frame rates of up 
to 475 FPS and 886 FPS are possible, if the number of optimization paths are reduced to four 
or two, respectively. (Ruf et al. 2021a, Sec. 1.2.2.) 


While GPUs provide great computational performance, a major drawback is given by their 
high power consumption. The deployment of the SGM algorithm on a high-end GPU only 
achieves 1.90 FPS/W on VGA image resolution (Hernandez-Juarez et al. 2016), while a compa- 
rable configuration of the algorithm deployed on a state-of-the-art embedded FPGA achieves 
15 FPS/W on a larger image resolution (Zhao et al. 2020). However, due to the increasing 
importance of deep learning algorithms for robotic applications, more and more embedded 
CPU-GPU-based SOCs are released and integrated into robotic systems. Recently, such SOCs 
with embedded GPUs have even been integrated onto COTS UAVs, and thus been made avail- 
able for the mainstream user. In their work, Hernandez-Juarez et al. (2016) have also deployed 
their implementation on the NVIDIA Jetson TX1, encapsulating the embedded Tegra X1 GPU, 
reaching 42 FPS (4.19 FPS/W) on VGA resolution and a 4-path SGM optimization. Chang et al. 
(2020) further optimized the computation of the normalized cross correlation (NCC) match- 
ing cost for the use on GPUs and deployed their SGM-based stereo algorithm on the NVIDIA 
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Jetson TX2 reaching 28FPS on images with a size of 1242 x 375 pixels. (Ruf et al. 2021a, 
Sec. 1.2.2.) 


In the scope of ReS*tAC, the SGM algorithm is also optimized for real-time processing on 
CUDA devices and deployed on the NVIDIA Jetson TX2 and the more powerful NVIDIA Jet- 
son Xavier AGX. In this, the performance of different configurations and optimization strate- 
gies with respect to performance and power consumption is evaluated. The computationally 
most expensive part of the SGM algorithm is the aggregation along the different scanlines. 
At the same time, due to the nature of dynamic programming, this is also the part which 
can be parallelized most effectively, since the computation of each scanline can be done fully 
independently, without the need of synchronization. In terms of GPGPU, this is typically 
done by instantiating one thread on the graphics hardware for each scanline, resulting in a 
massively parallel processing of the SGM path aggregation. This is also adopted by ReS*tAC 
as described in Section 3.3.1. (Ruf et al. 2021a, Sec. 1.2.2.) 


3.2.3 Are Embedded CPUs Suitable for Stereo Processing? 


In contrast to FPGAs and GPUs, which are designed to be less flexible and yet very powerful 
in processing specific tasks on a large amount of data, CPUs are designed to do more general 
and versatile processing, needed to allow computers to instantly react to new sensor input. 
Even though they have much higher clock frequencies, CPUs are often not capable to keep 
up with the performance achieved by their more specialized counterparts, due to their small 
number of cores and, in turn, limited ability of parallelization. (Ruf et al. 2021a, Sec. 1.2.3.) 


One of the first deployments of the SGM algorithm on a conventional CPU was done by 
Gehrig and Rabe (2010). They have implemented a number of different parallelization tech- 
niques, among others splitting the 8-path optimization scheme into two independent scans. 
With this, they achieve 14 FPS on images with a size of 640 x 320 pixels, but they only con- 
sidered a range of 16 disparities. They have deployed their algorithm on an Intel Core i7 with 
four cores and a clock frequency of 3.3 GHz. A few years later, Spangenberg et al. (2014) have 
achieved 16 FPS on VGA resolution and 128 disparities, also running their implementation on 
a conventional Intel Core i7 with four cores. Apart from a number of algorithmic optimiza- 
tions, e.g. disparity space compression and striped computation, they have parallelized the 
processing by utilizing single-instruction-multiple-data (SIMD) vectorization with the SSE 
instruction set from Intel combined with multi-threading. (Ruf et al. 2021a, Sec. 1.2.3.) 


The work of Arndt et al. (2013) is one of the first to deploy an implementation of the SGM 
algorithm on an embedded CPU, namely the Freescale P4080, reaching a frame rate of only 
0.5 FPS on VGA image resolution. When considering that this was done around the same time 
as the works of Gehrig and Rabe (2010) and Spangenberg et al. (2014), it illustrates the gap 
between conventional and embedded CPUs in terms of performance. However, the embedded 
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CPU technology has also evolved and gained performance. Most of the modern high-end 
embedded SOCs typically consist of an ARM CPU and a FPGA or a GPU, for example the 
Xilinx Ultrascale series or the NVIDIA Jetson series. Rahnama et al. (2018b) implemented 
the ELAS stereo algorithm (Geiger et al. 2011) on a Xilinx ZC706 SOC made up of an ARM 
CPU and a FPGA. In this, they have deployed the computationally most expensive stages 
of the algorithm onto the FPGA (if possible) and used the ARM CPU to process the stages 
with unpredictable memory access patterns. Saidi et al. (2020) accelerated a simple stereo 
algorithm on an ARM CPU, achieving frame rates of up to 59FPS on an image resolution of 
320 x 240 pixels. In this, they have parallelized the algorithm by using multi-threading as 
well as SIMD vectorization on ARM with the NEON instruction set (ARM 2013). (Ruf et al. 
2021a, Sec. 1.2.3.) 


Just as the dynamic programming in the SGM path aggregation is well-suited for massively 
parallel computing on graphics hardware, its computation can also be effectively vectorized 
by SIMD processing as Spangenberg et al. (2014) have shown. With this in mind, this work 
investigates the ability of embedded CPUs in performing high-accuracy stereo processing 
in real-time based on the SGM algorithm. So far, with ReS?tAC, this work is the first to 
implement and optimize the Semi-Global Matching stereo algorithm on an embedded ARM 
CPU, leveraging multi-threading parallelization and SIMD vectorization with NEON intrin- 
sics. (Ruf et al. 2021a, Sec. 1.2.3.) 
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3.3 ReS’tAC - Real-Time SGM Stereo Optimized for 
Embedded ARM and CUDA Devices 


As illustrated in Figure 3.1, the processing pipeline of ReS?tAC is divided into three subse- 
quent steps, which in turn are made of smaller building blocks. It follows the general pipeline 
for two-view stereo disparity and depth estimation as outlined by Figure 2.2. Given the images 
of a calibrated stereo camera system, it first rectifies the two input images according to the 
transformation maps that are pre-computed in the scope of calibrating the stereo setup. With 
the two rectified input images, dense image matching (DIM) and cost computation are per- 
formed (cf. Section 2.1), followed by a cost optimization by means of SGM (cf. Section 2.2.1). 
Lastly, disparity refinement, left-right consistency check and median filtering are executed 
in a post-processing (cf. Section 2.3) to produce the final output. In the following, detailed 
descriptions on the optimization of this processing pipeline for GPGPU on CUDA-enabled 
GPUs and on the vectorized SIMD processing with NEON intrinsics for ARM CPUs are given 
in Section 3.3.1 and Section 3.3.2, respectively. 


3.3.1 Real-Time Processing by Massively Parallel Computing on 
CUDA-Enabled GPUs 


Graphic processing units are designed for massively parallel processing. The number of pro- 
cessing units that are integrated into modern GPU hardware exceeds the number of cores 
available on a conventional CPU by far. Even embedded GPUs, like the one built into the 
NVIDIA Jetson Xavier AGX, have up to 100x more cores than high-end desktop CPUs. How- 
ever, the processing units on a GPU are less powerful and flexible than those of a CPU. While 
the CPU is designed to do arbitrary processing tasks, the cores of the GPUs are intended 
for the simultaneous and parallel processing of small and dedicated instructions on a large 
amount of data. (Ruf et al. 2021a, Appendix C.) 


Massively parallel general-purpose processing on NVIDIA GPUs is alleviated by the CUDA- 
API, which allows the deployment of routines and functions for data processing in the form 
of so-called CUDA kernels on the GPU. When deployed, such a kernel is instantiated inside a 
high number of threads, which are being distributed among the available processing units and 
are each processing a different subset of the data. For a better abstraction and handling, the 
threads are logically grouped into thread blocks, which share a local memory space and can 
thus exchange data between each other. Furthermore, the execution of all threads within one 
thread block can be halted and synchronized. All thread blocks are grouped into a grid. The 
grid size and the number of threads inside a grid are used to parameterize the instantiation 
of the kernel. The actual execution of the threads on a processing unit is always done in 
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groups of 32, which is referred to as a CUDA warp. For an efficient GPGPU, the developer 
is compelled to consider some design guidelines: 


e Reduction of global memory access due to higher latency. Instead, data which gets 
processed multiple times by threads in a thread block should be cached inside the 
shared memory space. 


« Pooling of global memory access and reduction of non-contiguous data storage. 


+ Efficient and maximum utilization of hardware resources. 
(Ruf et al. 2021a, Appendix C.) 


In the scope of ReS?tAC, each step of the stereo algorithm (cf. Figure 3.1) is implemented 
in separate CUDA kernels in order to optimize the stereo algorithm for embedded NVIDIA 
GPUs. Since each kernel execution is aimed to achieve a high utilization of the GPU, a par- 
allel execution of the CUDA kernel methods with CUDA streams is not implemented. The 
following sections provide a detailed description on how each step of the algorithm for an 
efficient GPGPU is optimized and instantiated. (Ruf et al. 2021a, Sec. 2.2.) 


3.3.1.1 Matching Cost Computation 


As already mentioned in Section 2.1, the two most commonly used matching cost functions 
are the Hamming distance of the census transform (CT) as well as an inverted and truncated 
version of the NCC, which, in turn, are also used by ReS?tAC. A detailed description on the 
implementation and the optimization of these two cost functions for execution on a GPU with 
CUDA is given in the following. (Ruf et al. 2021a, Sec. 2.2.1.) 


The Census Transformation and its Hamming Distance 


For the parallel calculation of the CT on the GPU, different image regions are assigned to each 
instantiation ofthe corresponding CUDA kernel. Before the actual CT is calculated, each ker- 
nel instantiation copies the pixel data of the considered image region into shared memory in 
order to achieve a higher access speed. The computation of the CT is then performed in par- 
allel by each thread of the thread block for a specific pixel in the assigned image region. To 
account for pixels at the image border, where a part of the neighborhood lies outside the im- 
age, a zero-valued margin with half the size of the neighborhood is assigned to the image for 
the calculation of the CT. Furthermore, the calculation of the Hamming distance is separated 
from the calculation of the CT, since the two kernel methods of these two steps are instanti- 
ated with different parameters. For the parallelization on the GPU, a CT with a neighborhood 
size of 5 X 5 pixels and 9 x 7 pixels is implemented, the latter being the largest neighborhood 
that fits into a 64-bit integer. The choice of the former neighborhood size is justified by the 
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Figure 3.2: Top: To calculate the CT by a CUDA warp for a specific image region, the image data of this region is 
first copied from the global memory to the shared memory, the latter having higher access speeds. In the 
second step, each thread of the CUDA warp calculates the CT for one pixel inside this region. Bottom: 
The threads of a CUDA warp calculate the Hamming distance for 16 disparities simultaneously. (Ruf 
et al. 2021a, Fig. 2.) 


limitations of the optimized implementation for the CPU (cf. Section 3.3.2.1). An overview 
of the implementation is illustrated by Figure 3.2 (top). (Ruf et al. 2021a, Sec. 2.2.1.) 


In the calculation of the Hamming distance, the reference and matching image are divided 
into stripes and each stripe is assigned to different CUDA warps. Each thread of the CUDA 
warps then calculates the Hamming distance from the corresponding census descriptors at a 
certain pixel position and disparity. The dimensions of the CUDA warps are chosen in such a 
way that for 16 different pixels half the disparities can be calculated simultaneously (Figure 3.2 
(bottom)). In this, the census descriptor of the reference pixels is first loaded by all threads 
of a thread block into the shared memory. Then, all threads load the census descriptors of 
the matching pixel given a certain disparity. This means that the 32 threads of a thread block 
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load the census descriptors CT(Jp(u — i,v)), ... , CT(Ir(u — i — 31,v)) from the matching 
image into the shared memory. This is repeated until for all disparities [0, Au .,] the census 
descriptors of the matching image are loaded into the shared memory. Given all the census 
descriptors, the threads of a thread block compute the Hamming distance simultaneously for 
different pixels and store the result at the corresponding position in the cost volume. Since 
the matching costs of the different disparities for one pixel lie directly next to each other in 
the cost volume, the dimension of the CUDA warp is chosen in such a way that the memory 
access from the GPU can be pooled together. (Ruf et al. 2021a, Sec. 2.2.1.) 


Inverted and Truncated Version of the Normalized Cross Correlation 


Using the NCC as a cost function is computationally more expensive than relying on the 
Hamming distance of the CT. While the computation of the CT and the subsequent Hamming 
distance only requires some comparative and bit-level operations, the computation of the 
NCC needs the calculation of a mean and variance of the two input patches. Since the mean 
and the variance of all possible patches inside an image can be precomputed and then reused, 
the calculation of the NCC and the process of image matching is divided into two separate 
stages, similar to the calculation of the CT and the Hamming distance. (Ruf et al. 2021a, 
Sec. 2.2.1.) 


In the first step, the mean and variance for all patches in the left and right input image are 
calculated. In this, a kernel with the same configuration as when calculating the CT is in- 
stantiated, iterating over all pixels in the left and right image, and the necessary data for a 
patch of a given size, centered around the current pixel, is computed. Similar to the process 
of computing the CT, the input images are thus first enhanced with the independent patch 
information, storing the patch-mean and patch-variance together with the pixel value of the 
center pixel in a special struct for each pixel of the input image. Just as when calculating 
the Hamming distance of the census transform, the pixel and the patch data are then used to 
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Figure 3.3: Illustration of the SGM path aggregation optimized for massively parallel processing on CUDA-enabled 
GPUs. Multiple CUDA blocks processing the SGM path aggregation. Each block calculates 16 different 
lines along one path direction. If a diagonal line reaches the image border, the values are reset and the 
calculation is resumed on the other side of the image. (Ruf et al. 2021a, Fig. 3.) 
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perform the image matching based on the inverted and truncated normalized cross correla- 
tion and the resulting cost volume is filled in the second stage. Again, a kernel with the same 
parameters as for the image matching with the Hamming distance is instantiated. The NCC 
is implemented for a patch size of 5 X 5pixels and 9 x 9 pixels. (Ruf et al. 2021a, Sec. 2.2.1.) 


3.3.1.2 Semi-Global Matching Optimization 


The calculation of the eight different SGM path costs is done sequentially on CUDA hardware. 
The parallelization of the cost aggregation on one path direction is realized on two different 
levels. First, each CUDA block calculates the costs for 16 different lines along one path di- 
rection (Figure 3.3). If a diagonal line reaches the image border, the values are reset and the 
calculation is resumed on the other side of the image. This ensures that all calculations of one 
path direction take the same time. Additionally, for each image point, the costs for the dis- 
parities Aug, ... , Ee — 1) and Alimar, .. , AUmax are calculated in parallel by two iterations. 
This ensures that all threads within one warp access a contiguous area in the memory, allow- 
ing the memory transactions to be more efficient. However, this requires a synchronization 
of the threads within a CUDA warp after the costs for all disparities have been calculated, in 
order to find the minimum path cost, which is necessary for further processing. (Ruf et al. 
2021a, Sec. 2.2.2.) 
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Figure 3.4: Illustration of the MapReduce method to find the WTA disparity with the minimum aggregated cost. 
Each active thread performs a comparison between two elements and stores the smaller element in a 
designated memory space. In each iteration, the number of aggregated costs that are being processed 
and the corresponding disparities are halved. (Ruf et al. 2021a, Fig. A3.) 
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To find the minimum of the aggregated costs, the MapReduce (Dean and Ghemawat 2008) 
method is utilized. In this, each active thread performs a comparison between two elements 
and stores the smaller element in a designated memory space. The number of active threads 
as well as the number of elements that are to be processed get halved by each iteration as 
illustrated by Figure 3.4. Thus, the search for the minimum cost requires log, (Au...) itera- 
tions. Ideally, the dimensions of the CUDA blocks are chosen in such a way, that there are 32 
active threads at the beginning. This allows that the MapReduce algorithm can be processed 
in only one warp. After the calculation and aggregation of the different SGM path costs, the 
WTA disparity with the minimum aggregated costs is to be found. This is done by assign- 
ing a specific image region to each CUDA warp and again utilizing the MapReduce method 
mentioned above. (Ruf et al. 2021a, Appendix D.) 


3.3.1.3 Consistency Check 


The key aspect in the consistency check is the calculation of the approximated disparity map 
UÈ corresponding to the matching image. This is approximated from the calculated aggre- 
gated cost volume S of the reference image. In this, each entry of WÈ is calculated according 
to Equation (2.14). The difficulty, that arises in this process, is the access of non-adjacent 
areas in S, as illustrated by Figure 3.5. Between all entries of the aggregated cost volume S, 
that are to be used for the calculation of IR, always lie Au,,,, + 1 entries, which are of no 
interest. (Ruf et al. 2021a, Sec. 2.2.3.) 


In the GPU implementation for the consistency check, each instantiation of a CUDA kernel 
computes a specified region in the approximated disparity map UÈ. Here, the data of the 
aggregated cost volume is first copied into shared memory for quicker access. Since the data 
does not lie next to each other, the access to the global memory by the different CUDA threads 
cannot be pooled together. When all data is available in the shared memory, the MapReduce 
method is again used to find the minimum. After the disparity map U for the matching image 
is approximated, each thread performs the consistency check according to Equation (2.13) for 
a specific pixel in the final disparity map U. (Ruf et al. 2021a, Sec. 2.2.3.) 
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Figure 3.5: In the consistency check, an additional disparity map, which corresponds to the matching image, is 
approximated from the aggregated cost volume $ (cf. Equation (2.14)). The required entries are not 
situated directly next to each other, which hinders an efficient memory access. (Ruf et al. 2021a, Fig. 4.) 
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3.3.1.4 Median Filter 


The execution of the median filter on the GPU is straight-forward. To each CUDA thread 
block a region in the final disparity map is assigned for which the filter is to be processed. In 
this, the necessary data is first copied to the shared memory. Pixels that lie outside the image 
get a disparity value of 0xFFFF assigned, making them irrelevant for the filtering. With all the 
data in the shared memory, each thread of the thread block calculates the median for a pixel. 
Here, the first five iterations of the bubble sort algorithm are performed to sort the values in 
the 3 X 3 pixels neighborhood. After the fifth iteration, the five highest values are correctly 
sorted and the median can be extracted. (Ruf et al. 2021a, Sec. 2.2.4.) 


3.3.2 Vectorized SIMD Processing with NEON Intrinsic Set on ARM 
CPUs 


While conventional single-instruction-single-data (SISD) processing on the CPU requires one 
instruction for each data transfer between the memory and the registers of the CPU, as well 
as for each processing of the data stored inside these registers, the additional use of vector- 
processors allows to perform one instruction on a set of data simultaneously, the so-called 
single-instruction-multiple-data (SIMD) processing. Each vector-processor is divided into mul- 
tiple vector-lanes, which in turn hold one datum each. The memory unit as well as the arith- 
metic and logical unit of the vector-processor allow to simultaneously transfer multiple data 
into and from the lanes of the vector-registers, as well as to simultaneously combine the lanes 
of two registers and store the results into the lanes of a third register. The vector-processors 
of the ARMv8 architecture contain 32 vector-registers with a size of 128 bits each (ARM 2017). 
The number of vector-lanes inside each register differs, depending on the data type which is 
to be stored inside the register. Thus, each register can have 16 lanes when data with a size 
of 8 bits each is to be stored, or only two lanes, if each lane holds a datum with the size of 
64 bits. In particular, image processing is well-suited for the SIMD parallelization on vector- 
processors, since all pixels in an image are processed in the same manner, only with different 
data. (Ruf et al. 2021a, Appendix E.) 


For an efficient use of vectorized SIMD processing, the developer is compelled to consider 
some design guidelines (ARM 2013): 


e Reduction of the dependencies between conventional CPU and vectorized SIMD 


processing, in order to minimize the latency induced by copying data between the 
SISD and SIMD pipeline. 


« Exploitation of cache coherence, to speed up the data transfer between the memory 
and the vector-registers. 
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« Dependencies in the data of vector-instructions trigger pipeline-stalls, in which the 
SIMD pipeline is stopped until the dependencies are resolved, slowing down the 
processing. 


e Minimal use of conditional branching, since if the branch prediction unit (BPU) of the 
CPU predicts the wrong branch, the pipeline has to be recursively cleared until the 
point of branching and then restarted. 


(Ruf et al. 2021a, Appendix E.) 


In order to efficiently deploy the real-time processing pipeline for the estimation of dense dis- 
parity maps on an embedded CPU, such as the 8-core ARMv8.2 on the NVIDIA Jetson Xavier 
AGX, two strategies of parallelization are utilized in the scope of this work as illustrated in 
Figure 3.6, namely: 


(i) a thread-level parallelization, and 
(ii) a vectorized data processing with the SIMD NEON intrinsics. 


The implementation uses eight concurrent threads to efficiently utilize the available CPU 
cores. In each step of the processing pipeline (cf. Figure 2.2), in exception to the SGM opti- 
mization step, each thread operates on a different image stripe, thus operating isolated and 
independently from the other threads. At two locations in the processing pipeline of the 
stereo algorithm, i.e. before and after the SGM optimization, the concurrent threads need 
to be synchronized, since the SGM optimization relies on a different thread barrier than the 
other steps of the pipeline. In the other steps of the pipeline, the threads do not require any 
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Figure 3.6: In the optimization of the SGM stereo algorithm for the execution on embedded CPUs, two different 
parallelization strategies are utilized. The implementation uses multiple threads to evenly distribute the 
processing on the available CPU cores. Each thread uses the AMD NEON instruction set to perform a 
vectorized SIMD processing by using the NEON processing units (PU). (Ruf et al. 2021a, Fig. 5.) 
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Figure 3.7: For the optimized CPU implementation, the census transform is processed for multiple pixels simultane- 
ously by utilizing the NEON vector processing units. In this, a sliding window is used, which processes 
the image data from left-to-right and from top-to-bottom. (Ruf et al. 2021a, Fig. 6.) 


synchronization and can thus be processed fully concurrently. As part of the second paral- 
lelization strategy, each thread uses the ARM NEON instruction set (ARM 2013) to perform a 
vectorized SIMD processing on the vector-processors of the CPU. (Ruf et al. 2021a, Sec. 2.3.) 


In the following sections, the optimization of each step of the processing pipeline for exe- 
cution on an ARMv8 CPU, utilizing thread parallelism and SIMD processing, is discussed. 
Since the use of the NCC as a cost function for the image matching is computationally more 
expensive, and the frame rates achieved by the use of the CPU are anyway much lower than 
those achieved on the GPU, the NCC is not implemented for a vectorized SIMD processing 
on the CPU. (Ruf et al. 2021a, Sec. 2.3.) 


3.3.2.1 Calculating the Census Transformation and its Hamming Distance 


As illustrated by Figure 3.7, in each thread, the CT is calculated for 16 pixels simultaneously, 
using the SIMD vector-registers. In this, the 16 reference pixels with 8 bits each are loaded 
into one vector-register. In addition, the corresponding 24 X 16 neighbor pixels are loaded 
into one vector-register each, resulting in a total use of 25 vector-registers with 16 lanes. The 
full image is processed by sliding the illustrated window from left-to-right and top-to-bottom 
over the image. In doing so, there is a good chance that the image data, which is needed by the 
next iteration, is already cached. In the calculation of the CT, the comparison of the reference 
pixel with itselfis omitted. This allows to represent the 24 bits of the resulting CT bit-string by 
three bytes and thus store the census descriptors of all 16 pixels in three vector-registers. This 
is also why only a support region of 5X5 pixels is considered in the optimized implementation 
of the CT for the ARM CPU. In order to also calculate the CT at the image border, where a 
part of the neighborhood lies outside the image, a conditional statement would need to be 
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Figure 3.8: Illustration of the calculation of the Hamming distance with NEON intrinsics. Each census descriptor is 
loaded from J, and Jp in separate vector-registers, on which a XOR operation is applied. The number 
of set bits inside a vector-register is counted by using the NEON hardware instruction VCNT (Vector 
Count Set Bits). (Ruf et al. 2021a, Fig. 7.) 


introduced, which is not recommended when using SIMD vectorization. Thus, the CT is only 
calculated up to two pixels with respect to the image border, reducing the image size by four 
pixels in each dimension for the subsequent processing. (Ruf et al. 2021a, Sec. 2.3.1) 


To calculate the Hamming distance between two census descriptors, namely the one from 
the pixel of the reference image and the corresponding pixel in the matching image, the XOR 
operator is applied followed by the counting of how many bits are set to 1 in the final output. 
To count the number of bits which are set, the NEON instruction set offers a population count 
(VCNT) which can be applied to each vector-lane. As illustrated by Figure 3.8, the resulting 
matching cost, i.e. the Hamming distance of the CT, is then stored into the three-dimensional 
cost volume. By utilizing the SIMD instructions and the 32 vector-register, a maximum of 64 
matching costs can be calculated simultaneously in each thread. Thus, each thread processes 
16 disparities and four lines simultaneously in one iteration. (Ruf et al. 2021a, Sec. 2.3.1) 


44 


3.3 ReS?tAC - Real-Time SGM Stereo Optimized for Embedded ARM and CUDA Devices 


In case the currently calculated disparity is bigger than the u-coordinate ofthe reference pixel, 
the corresponding matching pixel lies outside the image. In order to efficiently handle this 
case, the disparity for which the matching costs are currently being calculated as well as the 
u-coordinate ofthe currently processed pixel are additionally stored in two additional vector- 
registers. In each iteration, both of the above-mentioned registers are compared against each 
other and the result is stored in a third register. If the disparity is greater than the u-coordinate 
of the pixel, all bits inside the vector-lanes will be set. Finally, if an OR operation between 
the register with the matching cost and the register with the comparison result is applied, the 
matching cost of each disparity that spans over the image boundary will be set to 0xFF and thus 
will not contribute in the subsequent search for the optimum. (Ruf et al. 2021a, Sec. 2.3.1) 


3.3.2.2 Semi-Global Matching Optimization 


The optimized implementation of the SGM algorithm can be divided into two separate steps. 
First, each thread calculates the SGM path costs for each path that is assigned to it accord- 
ing to Equation (2.11). As illustrated by Figure 3.9, the four vector-registers are first filled 
with the results of the previous iteration, so that the vector-register L,(p — r, Au) will hold 
the previous path costs at the same disparity level, the vector-registers L,(p — r, Au — 1) and 
L,(p — r, Au + 1) will hold the previous path costs at the disparity level +1, and the vector- 
register min, L,(p — r, Au) will hold the minimum path costs over all disparity levels at the 
considered pixel. According to Equation (2.11), the current matching costs from the cost vol- 
ume as well as the penalties are added to the different vector-registers. Again, the NEON 
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Figure 3.9: Schematic overview of the implementation of the SGM aggregation at a single pixel on an aggregation 
path. All components of the SGM path aggregation (cf. Equation (2.11)) are calculated simultaneously. 
A final minimum operation will yield the result which is stored in the aggregated cost volume. (Ruf 
et al. 2021a, Fig. 8.) 
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Figure 3.10: Illustration of different traversal strategies in the aggregation of the SGM path costs within the opti- 
mized implementation using NEON intrinsics. (Ruf et al. 2021a, Fig. 9.) 


instruction set provides a method to get the minimum from the four vector-registers. The re- 
sult is stored in the allocated memory and is compared to the path costs of the other disparities 
in order to get the minimum path cost for the next iteration. (Ruf et al. 2021a, Sec. 2.3.2) 


In each thread, the costs for the disparities of two pixels are calculated simultaneously. In this, 
the implementation of the horizontal and vertical paths (Lir, Lre, Lrg, Lgr) differ from the 
implementation of the diagonal paths (Ly, pp, Lrppr, Lette, LgrrL). On the straight paths, each 
thread processes two pixels on two neighboring rows or columns (Figure 3.10a). However, 
due to the different lengths of the diagonal paths, this cannot be applied to the processing of 
the same. Instead, on the diagonal paths, each thread processes two pixels lying on opposite 
sides of the image. This is illustrated by Figure 3.10b. (Ruf et al. 2021a, Sec. 2.3.2) 


In the second step of the SGM aggregation, each thread sums up all SGM path costs and finds 
the disparity with the minimum cost, i.e. the WTA cost, for each image stripe assigned to it. 
The different path costs are first copied into different vector-registers and then summed up. 
While summing up, each thread additionally stores the currently processed disparity as well 
as the minimum aggregated cost and the corresponding WTA disparity in additional vector- 
registers. Afterwards, the aggregated costs are compared to the minimum costs which are 
updated if necessary. Ifthe minimum costs are updated, the corresponding WTA disparity is 
updated accordingly. The final aggregated path costs are stored into an aggregated cost vol- 
ume $, which is needed for the subsequent consistency check, while the final WTA disparity 
is written into the disparity map. (Ruf et al. 2021a, Sec. 2.3.2) 


3.3.2.3 Consistency Check 


In the optimization of the consistency check for the CPU, the same difficulty as in the imple- 
mentation of the approximated consistency check for the GPU is encountered, namely that 
the required data from the aggregated cost volume does not lie physically next to each other 
(cf. Figure 3.5). This makes it inefficient to load the data into vector-registers first and then 
process it with NEON intrinsics. There are two ways to solve this problem: The first possibil- 
ity, which was proposed by Spangenberg et al. (2014), is to first transform the aggregated cost 
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volume S into a temporary volume in such a way that the data which is required to compute 
the approximated disparity map U® will physically lie next to each other in memory. After- 
wards, the vector-registers can be filled with the data and the approximated disparity map U® 
corresponding to the matching image can be calculated efficiently with SIMD instructions. 
The second option, which is realized in the scope of this work, is to refrain from using SIMD 
instructions for the consistency check. Thus, in the implementation of the consistency check, 
only a thread-level parallelization, in which each thread is processing a different part of the 
cost volume, is used. This saves the need to rearrange the cost volume necessary to use SIMD 
instructions. (Ruf et al. 2021a, Sec. 2.3.3) 


3.3.2.4 Median Filter Based on Sorting Networks 


After the consistency check, a final 3x3 median filter is employed in order to remove any small 
outliers that still remain in the disparity map U. This requires a sorting of all disparity values 
within the local neighborhood. In general, sorting algorithms like BubbleSort, MergeSort 
or QuickSort are able to sort an arbitrary number of input values and, in turn, require an 
indefinite number of comparative and swapping operations, making a naive implementation 
inappropriate for vectorized processing. In the case of the fixed-size median filter, however, 
the number of considered input values are fixed and are known in advance. This allows to 
use sorting networks (Knuth 1998), which sort an input vector of known size with a fixed 


Figure 3.11: Exemplary sorting network, implementing the BubbleSort algorithm for nine input values. The hori- 
zontal lines represent the wires, while the comparators are illustrated by the vertical connections. The 
values of the input vector move along the wires from left to right and get rearranged by the compara- 
tors, assigning the smaller value to the upper wire. This results in a sorted output vector on the right 
with the smallest value at the top. (Ruf et al. 2021a, Fig. A4.) 
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number of comparators and swapping operations. A sorting network is comprised of two 
basic building blocks, namely: 


(i) wires, which hold and transport one value of the input vector each, and 


(ii) comparators, which are responsible for comparing the values of the connected wires 
and swap these if necessary. 


The comparators always connect two wires with each other and assign the smaller value to 
the upper wire. Figure 3.11 illustrates the BubbleSort algorithm implemented using a sorting 
network. In this, the horizontal lines represent the wires, while the comparators are illus- 
trated by the vertical connections. The values of the input vector move along the wires from 
left to right and get rearranged by the comparators, resulting in a sorted output vector on the 
right with the smallest value at the top. (Ruf et al. 2021a, Appendix F.) 


In ReS?tAC, the concept of sorting networks is utilized in order to allow for a parallel sort- 
ing with SIMD intrinsics and, in turn, for the vectorized processing of the 3 x 3 median 
filter. In this, the wires of the sorting network are realized using the vector-lanes of the 
vector-register, distributed among different vector-registers, utilizing one vector-lane from 
each register. Thus, for the implementation of one sorting network for the median filter with 
nine wires, nine different vector-lanes distributed over nine different vector-registers are uti- 
lized (Figure 3.12). (Ruf et al. 2021a, Sec. 2.3.4) 


The comparators of the sorting network are implemented by using two comparison instruc- 
tions of the NEON instruction set, that compare each vector-lane of two vector-registers and 
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Figure 3.12: Illustration of the implementation of sorting networks using vectorized SIMD processing. The nine 
wires of a sorting network are mapped to vector-lanes of nine different vector-registers. (Ruf et al. 
2021a, Fig. 10.) 


48 


3.4 Experiments 


store the minimum or maximum in a third register. Thus, for each vector-lane, the minimum 
and maximum value are first extracted from the two source-vector-registers. Then, the min- 
imum is copied to the register which represents the upper lane of the sorting network, while 
the maximum is copied to the register which represents the lower lane. By this vectorized 
parallelization, the median filter is computed for 16 pixels simultaneously. Inherent to the 
nature ofthe BubbleSort algorithm, it is only necessary to calculate the first five iterations to 
get the median of a support region with a size of 3 x 3 pixels. (Ruf et al. 2021a, Sec. 2.3.4) 


3.4 Experiments 


In the scope of this work, ReS?tAC is quantitatively evaluated on two public stereo bench- 
marks with respect to accuracy, run-time and power consumption. In this, the results are 
also compared to those achieved by related approaches from the literature. For a qualitative 
demonstration of the usability of ReS?tAC for real-time on-board processing a UAV, it is de- 
ployed on a DJI Manifold 2-G, which, in turn, is mounted on a DJI Matrice 210v2 RTK. While 
the quantitative evaluation is presented in Section 3.4.1, qualitative results of the execution 
of ReS?tAC on board a UAV are presented in Section 3.4.2. 


3.4.1 Quantitative Evaluation of Accuracy on Public Stereo 
Benchmarks 


The quantitative assessment of the performance of ReS?tAC and its different configurations 
for embedded stereo processing comprises an evaluation with respect to their accuracy in 
Section 3.4.1.2, as well as studies on the effects of the subpixel disparity refinement in Sec- 
tion 3.4.1.3 and the improvements gained by an accurate left-right consistency check in Sec- 
tion 3.4.1.4. Moreover, in Section 3.4.1.5, a study on the processing speed and power con- 
sumption of ReS?tAC, together with the effects of reducing the aggregation paths of the SGM 
optimization, is conducted. In the scope of the quantitative evaluation, ReS*tAC was deployed 
on the NVIDIA Jetson Xavier AGX with an 8-core 64-bit ARMv8.2 CPU and a 512-core Volta 
GPU. All measurements with respect to accuracy, timings and power consumption were done 
on this hardware. (Ruf et al. 2021a, Sec. 3.1.) 


3.4.1.1 Evaluation Datasets and Error Measures 


For the evaluation of the accuracy of ReS?tAC and its different configurations, the training set 
of the KITTI 2015 stereo benchmark (Menze and Geiger 2015), which consists of 200 stereo 
image pairs and ground truth disparity maps captured by a LiDAR sensor from on top of a 
car driving around urban areas, as well as the Middlebury 2014 stereo benchmark (Scharstein 
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et al. 2014) are used. Ihe latter one allows a more thorough evaluation on the accuracy and 
the effects of different optimizations in the processing pipeline, since it consists of 15 high- 
resolution stereo pairs of indoor scenes, together with highly accurate and dense ground truth 
disparity maps computed by means of structured lighting. (Ruf et al. 2021a, Sec. 3.1.) 


The standard evaluation routine of the KITTI 2015 stereo benchmark (Menze and Geiger 2015) 
states the accuracy as the amount of erroneous pixels (D1-all), averaged over all m ground 
truth pixels in the evaluation set, for which the estimated disparity Au,,, differs by 3 or more 
pixels with respect to the ground truth Aug: 


1 m 
D1-all(Auest Aug) = — DY) [Attest — Augl > 3], (3.1) 
i=} 


with [-] representing the Iverson bracket. Since the ground truth was generated from a Li- 
DAR sensor mounted at a slightly different position as the camera, for which the disparity 
map is estimated, the ground truth also provides disparity values in areas which are occluded 
in the second camera image and, in turn, usually only contains limited information in the 
estimated disparity map. Although the KITTI benchmark also provides ground truth maps 
which only contain non-occluded (noc) areas, the standard evaluation protocol uses the oc- 
cluded (occ) dataset, which has also been used for evaluation scores in Table 3.2. Further- 
more, the benchmark distinguishes between the results of the actual estimated (Est) disparity 
maps and interpolated versions of them (All). The latter ones allow a comparison between 
disparity maps of different density by applying a background interpolation to fill the pixels 
in the estimated disparity map for which no data is available. However, since ReS?tAC uses 
a left-right consistency check and a median filter to explicitly remove outliers and inconsis- 
tent areas, the results achieved by the actual estimate are of greater interest. Nonetheless, 
for comparison, the results achieved by the interpolated disparity maps are also provided as 
well as the information on the density of the non-interpolated map, if available, which states 
the amount of pixels in the estimated disparity map which contain valid estimates. (Ruf et al. 
2021a, Sec. 3.1.) 


Similar to the evaluation routine of the KITTI 2015 stereo benchmark, the Middlebury bench- 
mark ranks the algorithms based on four different accuracy levels, namely the amount of 
pixels whose error is greater than 0.5 (bad0.5), 1 (bad1), 2 (bad2) and 4 (bad4) pixels with 
respect to all m ground truth pixels in the evaluation set: 


1 m 
badO(Au., Aug) = — 2, [[Attest — Atigi] > O], (3.2) 
i=1 


with Auss and Aug again denoting the estimated and ground truth disparity respectively, 
and [-] representing the Iverson bracket. The data is provided in full (F) image resolution 
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with up to 3000 x 2000 pixels and a disparity range of up to 800 pixels as well as half (H) and 
quarter (Q) image resolution. The official evaluation is always performed on the full image 
resolution. Thus, ifthe results are generated on a dataset with a smaller resolution, the results 
are first being upsampled before being evaluated. (Ruf et al. 2021a, Sec. 3.1) 


3.4.1.2 Accuracy 


In this section, the accuracy achieved on the KITTI 2015 stereo benchmark is listed and dis- 
cussed first. This is followed by the evaluation of the accuracy achieved on the Middlebury 
2014 stereo benchmark. 


KITTI 2015 Stereo Benchmark: 


Table 3.2 lists the quantitative results of the accuracy of different configurations of ReS*tAC 
as well as those of other approaches and implementations, which are achieved on the KITTI 
2015 stereo benchmark. While the results of ReS?tAC were achieved on the training set ofthe 
benchmark, the results of the approaches from literature were taken either from the corre- 
sponding publication or the official listing of the benchmark, which lists the results achieved 
on the actual test set. The upper part of the Table 3.2 lists algorithms and configurations, 
which are optimized for the deployment and execution on embedded hardware. Not all of 
these are variants of the Semi-Global Matching stereo algorithm. Yet, they serve as a good 
comparison since they were deployed on the same or similar hardware as the one presented 
in this work. The three algorithms at the bottom of the list serve as a baseline to ReS?tAC. 
While the one from Hirschmiiller (2008) reveals the accuracy achieved by the original SGM 
algorithm executed on a GPU with the census transform of unknown size as a cost function, 
the Semi-Global Block Matching (SGBM) variant of the OpenCV library is widely spread and 
easy to use, however, not optimized for embedded processing. The algorithm of Schönberger 
et al. (2018) is evaluated on both the KITTI 2015 stereo benchmark and the Middlebury 2014 
stereo benchmark. They propose to use a random forest classifier to learn to efficiently fuse 
the different scan-line optimizations of the SGM algorithm, in order to reduce the number of 
optimization paths for embedded processing more efficiently. (Ruf et al. 2021a, Sec. 3.1.1.) 


The accuracy of ReS?tAC is evaluated using different cost functions and different support 
regions. For each configuration, the results achieved on the original image size provided 
by the KITTI benchmark, ie. 1242 x 375 pixels, as well as on images with VGA resolution 
were evaluated. In the case of VGA resolution, the original images were first downsampled 
to a resolution of 640 x 480 pixels. Then, the stereo disparity estimation was performed 
and finally the resulting disparity maps were again upsampled to the original image size 
with a nearest-neighbor interpolation and a scaling of the disparities by the horizontal scale 
factor. Thus, for the calculation of the accuracy measurements, the original image size was 
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Table 3.2: Accuracy achieved by different algorithms and configurations of ReS*tAC on the KITTI 2015 stereo 
benchmark (Menze and Geiger 2015). While the upper part lists algorithms that are optimized and de- 
ployed on embedded hardware, the three algorithms at the bottom are listed as a reference and a baseline. 
The results achieved by different configurations of ReS*tAC are listed in the middle section. The accuracy 
is stated as the amount of erroneous pixels (D1-all), for which the estimated disparity differs by 3 or more 
pixels with respect to the ground truth. The KITTI 2015 benchmark distinguishes between the result of 
the actual estimated (Est) disparity map and an interpolated version of it (All), in which the pixels, for 
which no disparity is available, get interpolated by a simple background interpolation. As an evaluation 
ground truth, all available pixels were considered, not only the non-occluded ones. The density indi- 
cates, how many pixels inside the computed disparity maps have an estimate. +: The accuracy stated is 
computed with respect to the non-occluded pixels in the ground truth. (Ruf et al. 2021a, Tab. 2.) 


Approach Configuration HW Resolution D1-all(inz) Density 

Device (in pixels) Est. All (in %) 
Zhao et al. (2020) CT3x5-SGM FPGA 1242 x 375 - 118 z 
Zhao et al. (2020) CT7x7-SGM FPGA 1242 x 375 - 95 a 
Ruf et al. (2018a) CT;xs-SGM FPGA 640x360 46 31.2 46.4 
Rahn. et al (2018) CTsxs-MGM FPGA 1242x375 67 136 81.0 
Rahn. et al (2018) CTizxı3z-MGM FPGA 1242x375 48 99 85.0 
Cui et al. (2019)} NCCo,.9-nativ GPU 1242 x 375 - 16.6 - 
Cui et al. (2019)+ NCCox9-optim. GPU 1242 x 375 - 131 - 
Chang et al. (2020) | _Z?-ZNCC GPU 1242x375 76 77 99.9 
Hern.-J. et al. (2016) CToxr-SGM GPU 1242x375 82 82 100 
ReS?tAC-CUDA CT5x5 -SGM GPU 640 x 480 5.4 8.4 94.5 
ReS?tAC-CUDA CT5x5 -SGM GPU 1242 x 375 4.3 8.3 88.8 
ReS?tAC-CUDA CTox7-SGM GPU 640 x 480 5.1 7.9 94.6 
ReS?tAC-CUDA CTox7-SGM GPU 1242 x 375 4.0 7.7 90.0 
ReS*tAC-CUDA NCC;5x5 -SGM GPU 640 x 480 5.3 7.8 94.8 
ReS?tAC-CUDA NCC;5x5 -SGM GPU 1242 x 375 4.3 8.1 90.0 
ReS?tAC-CUDA NCCoxo-SGM GPU 640x480 59 8.2 94.7 
ReS?tAC-CUDA NCCox9 -SGM GPU 1242 x 375 4.8 8.3 91.1 
ReS?tAC-NEON CT5x5 -SGM CPU 640 x 480 5.0 7.9 94.5 
ReS?tAC-NEON CT;xs-SGM CPU 1242x375 4.6 85 90.0 
Schönb. et al. (2018) NCC,.7-SGMf. CPU 1242x375 43 44 99.9 
OpenCV-SGBM SAD3x3-SGM CPU 1242 x 375 5.9 10.9 90.4 
Hirschmüller (2008) | CT-SGM GPU 1242 x 375 6.4 6.4 100 


used. For the selection of the SGM penalties, different values were empirically evaluated. 
Those achieving the best results were then chosen for the subsequent experiments. These are 
Pı = 27 and p, = 86 for the CTo,, and pı = 90 and 9, = 880 for the NCC;xs. An excerpt 
on the qualitative results for the best configuration of ReS*tAC is presented in Figure 3.13. 
(Ruf et al. 2021a, Sec. 3.1.1.) 
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Figure 3.13: Four exemplary results from the KITTI 2015 stereo benchmark, computed with the CTgx7 - SGM 
configuration on the original image size. Rows 1 & 4: Reference images. Rows 2 & 5: Estimated 
disparity maps. Rows 3 & 6: Color-coded error images between the prediction and the ground truth. 
Error images use a log-color scale as described by Menze and Geiger (2015), marking correct estimates 
in blue color tones and wrong estimates in red color tones. (Ruf et al. 2021a, Fig. 11.) 


| The results in Table 3.2 reveal, that the use of the census transform with a support region 
of 9 X 7pixels and its Hamming distance as a matching cost function achieves the best re- 
sults in the evaluation of both the actual estimate and the interpolated version. As can be 
expected, the use of downsampled versions of the input images yields less accurate results 
and yet, achieves a higher density in the resulting disparity maps. Furthermore, the use ofthe 
normalized cross correlation as a cost function achieves slightly less accurate results, while 
leading to a smaller throughput as discussed in Section 3.4.1.5. The implementation on the 
CPU using NEON SIMD intrinsics with a CT of size 5 x 5 pixels achieves similar and, in case 
| of the smaller image resolution, slightly better results than its GPU counterpart. In summary, 
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when considering the actual estimate, the configuration CTox7 - SGM executed on the GPU, 
with the original image size of 1242375 pixels, outperforms the baseline implementations as 
wellas the other algorithms optimized for embedded hardware. When evaluating the interpo- 
lated disparity maps, the approaches achieve similar and mostly better results than the other 
embedded algorithms. The superiority of the baseline algorithms from Hirschmiiller (2008) 
and Schönberger et al. (2018) with respect to the quality of the disparity estimation is to be 
expected, since they were optimized with respect to accuracy and not speed or throughput. 
(Ruf et al. 2021a, Sec. 3.1.1.) 


Middlebury 2014 Stereo Benchmark: 


The results achieved by ReS?tAC on the training set of the Middlebury 2014 stereo bench- 
mark are listed in Table 3.3, with a qualitative excerpt of the results achieved by the best 
configuration presented in Figure 3.14. None of the other approaches for stereo processing 
on embedded hardware, which were listed in the evaluation on the KITTI benchmark, have 
also been evaluated on the Middlebury 2014 stereo benchmark and, thus, these are not listed 
in this evaluation. However, results on the non-embedded baseline algorithms are available 
and are again listed in the lower part of Table 3.3. In this evaluation, the same configurations 
as those evaluated on the KITTI benchmark were studied. Since ReS*tAC can only handle a 
disparity range of up to 256 pixels, the disparity maps were computed on the provided quarter 
image resolution (Orig. Q). Just as in the standard evaluation routine of the benchmark, the 
results listed were found after upsampling the disparity maps to the original image resolution 


Table 3.3: Accuracy achieved by different algorithms and configurations of ReS?tAC on the Middlebury 2014 stereo 
benchmark (Scharstein et al. 2014). While the upper part lists the results of different configurations of 
ReS?tAC, the three algorithms at the bottom are listed as a reference and a baseline. The accuracy is 
stated as the amount of erroneous pixels, whose error is greater than 0.5 (bad0.5), 1 (bad1), 2 (bad2) 
and 4 (bad4) pixels with respect to the ground truth. The density indicates, how many pixels inside 
the computed disparity maps have an estimate. The Middlebury 2014 stereo benchmark provides the 
image data in full (F), half (H) and quarter (Q) image resolution. The results of ReS?tAC were computed 
on the quarter image resolution and evaluated according to the standard evaluation pipeline on the full 
resolution. (Ruf et al. 2021a, Tab. 3.) 


Approach Configuration Resolution Error (in %) Density 

bad0.5 badl bad2 bad4 (in %) 
ReS?tAC-CUDA CT5x5-SGM Orig. Q 77.0 595 354 13.5 93.0 
ReS?tAC-CUDA CTgx7-SGM Orig. Q 713 599 357 13.6 93.3 
ReS?tAC-CUDA NCCs,.5-SGM Orig. Q 77.1 60.1 363 144 92.8 
ReS?tAC-CUDA NCCoy9-SGM Orig. Q 77.1 61.2 387 17.1 91.9 
ReS?tAC-NEON CT5x5-SGM Orig. Q 76.2 590 351 13.4 92.1 
Schönb. et al.(2018) = NCC 7,7-SGM-F. Orig. H 43.1 148 7.0 37 100 
OpenCV-SGBM SAD3y3-SGM Orig. Q 67.3 421 255 173 100 
Hirschmüller (2008) CT-SGM Orig. H 51.5 282 177 1222 100 
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Figure 3.14: Three exemplary results from the Middlebury 2014 stereo benchmark, computed with the best- 
performing base configuration, i.e. CT5x5 - SGM on the quarter image resolution. Row 1: Reference 
images. Row 2: Estimated disparity maps. Row 3: Ground truth disparity maps. (Ruf et al. 2021a, 
Fig. 12.) 


with a nearest-neighbor interpolation and scaling the contained disparities with a factor of 
4. Again, for the selection of the SGM penalties, different values were empirically evaluated 
and selected according to the results achieved, being 9, = 11 and @2 = 39 for the CTsx; and 
pı = 140 and 9, = 730 for the NCCs,5. (Ruf et al. 2021a, Sec. 3.1.1.) 


Table 3.3 does not reveal any satisfying results. This, however, is not surprising, since only 
the quarter image resolution was used to compute the results, which were then upsampled 
by a factor of 4 for evaluation, introducing a lot of errors due to interpolation. As stated 
by Scharstein et al. (2014), the aim of this benchmark is to provide new challenges for mod- 
ern stereo algorithms in terms of image resolution, accuracy and scene complexity, and not 
necessarily on the evaluation of optimizations with respect to computational efficiency and 
run-time. Nonetheless, the accuracy in the ground truth and the evaluation protocol of the 
Middlebury 2014 stereo benchmark allows for an evaluation of the improvement gained by a 
subpixel disparity refinement as done in the following section. (Ruf et al. 2021a, Sec. 3.1.1.) 
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Table 3.4: Accuracy achieved by selected configurations with an additional subpixel disparity refinement on the 
KITTI 2015 stereo benchmark (Menze and Geiger 2015). The corresponding differences to the accuracies 
listed in Table 3.2 are given in parentheses. (Ruf et al. 2021a, Tab. 4.) 


Approach Configuration Resolution D1-all (in %) 
(in pixels) Est. All 
ReS*tAC-CUDA CTsxs-SGM-fine 1242X375 3.5(-0.8) 8.0(-0.4) 
ReS*tAC-CUDA — CT 9,7-SGM-fine 1242 x 375 3.3(-0.7) 7.4(-0.3) 
TIE 
7.9(- 


ReS?tAC-CUDA NCCs,5-SGM-fine 1242x375 3.7(-0.6) 0.4) 
ReS?tAC-CUDA NCCoy9-SGM-fine 1242375 4.3(-0.5) 0.4) 


Table 3.5: Accuracy achieved by selected configurations with an additional subpixel disparity refinement on the 
Middlebury 2014 stereo benchmark (Scharstein et al. 2014). The corresponding differences to the accu- 
racies listed in Table 3.3 are given in parentheses. (Ruf et al. 2021a, Tab. 5.) 


Approach Configuration Res. Error (in %) 

bad0.5 badl bad2 bad4 
ReS?tAC-CUDA CTsx5-SGM-fine Orig. Q 72.4(-4.6) 52.1(-7.4) 26.1(-9.3) 10.4 (-3.1) 
ReS?tAC-CUDA CTox7-SGM-fine Orig. Q 73.0(-4.3)  52.7(-7.2) 26.6(-9.1) 10.7(-2.9) 
ReS*tAC-CUDA NCCs,5-SGM-fine Orig. Q 73.8(-3.3) 53.8(-6.3) 27.8(-85) 12.1 (—2.0) 
ReS*tAC-CUDA NCCox9-SGM-fine Orig. Q 73.8(-3.3) 55.2(-6.0) 30.6(-8.1) 14.8 (2.3) 


3.4.1.3 The Effect of Subpixel Disparity Refinement 


As described in Section 2.3.1, a subpixel disparity refinement can be computed for each pixel 
in the disparity map by fitting a parabola through the matching costs of the winning disparity 
and its two neighbors. This achieves an increase in accuracy of up to 0.8 % in case of the KITTI 
benchmark (Table 3.4), and up to 9% in case of the Middlebury benchmark (Table 3.5), and 
yet only requires a small computational overhead (cf. Table 3.6). (Ruf et al. 2021a, Sec. 3.1.2.) 


3.4.1.4 Accurate Left-Right Consistency Check 


To evaluate the effects of only approximating the disparity map corresponding to the right im- 
age of the stereo pair, which is needed for the left-right consistency check (cf. Section 2.3.3), a 
more exact and computationally more expensive consistency check was also implemented for 
the GPU. In this, the reference and matching images are switched and flipped and a second 
disparity map for the original matching image is calculated. This leads to a more accurate 
disparity map UÈ for the right input image, which, in turn, is used by Equation (2.13) of the 
consistency check. With a more accurate UR, it is assumed that the consistency check is more 
effective in filtering outliers, but does the high computational overhead of fully calculating 
two disparity maps justify the increase in accuracy? The results of the evaluation do not indi- 
cate a significant improvement. The studies reveal an increase in accuracy of only 0.4-1.0 %, 
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when calculating the disparity map of the right image from scratch compared to only ap- 
proximating it from the cost volume corresponding to the left disparity map. However, the 
throughput is nearly halved when using a more accurate consistency check, as illustrated by 
configurations with the suffix “exact-cc” (exact consistency check) in Table 3.6. (Ruf et al. 
2021a, Sec. 3.1.3.) 


3.4.1.5 Throughput, Frame Rates and Power Consumption 


A typical measure to quantify the processing speed of a stereo algorithm is the number of 
frames per second (FPS) which can be calculated. However, since the FPS greatly depends on 
the image size of the output and the disparity range, the efficiency of ReS?tAC is assessed 
based on the throughput achieved, which is measured in million disparity estimations per 
second (MDE/s): 


w-h- [Tul 


MDE/s = - 5 
run-time 


(3.3) 
with w and h being the width and the height of the resulting disparity map, and |T,,| being 
the size of the disparity range. Given the throughput achieved by a certain configuration 
or algorithm, it is possible to deduce the expected frame rates for a set of image size and 
disparity range: 
MDE/s 
FPS = ————_ 3.4 
nl (3.4) 
as done in Figure 3.15. Furthermore, the throughput allows for a better comparison between 
different algorithms with respect to their processing speed, due to its independence of a fixed 
image size and disparity range. (Ruf et al. 2021a, Sec. 3.1.4.) 


In Table 3.6, the throughput achieved by different configurations of ReS?tAC as well as the 
throughput of selected embedded algorithms from literature is listed. While all measurements 
in this work were done on the NVIDIA Jetson Xavier AGX with maximum performance, 
the related work optimized for the execution on GPU was deployed on the NVIDIA Jetson 
TX1 and TX2. In case of the related work, which has not explicitly stated the throughput 
of the respective algorithm, the values were calculated by using the stated frame rate and 
the corresponding image size according to Equation (3.4). For the run-time measurements, 
the whole processing pipeline was considered, including the upload of the stereo image pair 
and the download of the disparity image to and from the device memory of the GPU. (Ruf 
et al. 2021a, Sec. 3.1.4.) 


Firstly, the measurements reveal the higher computational efficiency of the census transform 
as a matching cost function with respect to the normalized cross correlation. And secondly, 
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Table 3.6: Throughput achieved by ReS?tAC and selected embedded algorithms from literature. The throughput is 
measured in million disparity estimations per second (MDE/s). In case of ReS?tAC, all the measurements 
were done on the NVIDIA Jetson Xavier AGX board, with the power setting set to maximum performance. 
(Ruf et al. 2021a, Tab. 6.) 


Approach Configuration HW Device Throughput 

(in MDE/s) 
Zhao et al. (2020) CTsx5-SGM FPGA 9589.9 
Zhao et al. (2020) CT,x7-SGM FPGA 8743.7 
Ruf et al. (2018a) CTsx5-SGM FPGA 400.8 
Rahn. et al. (2018) CTsxs-MGM FPGA 4246.9 
Cui et al. (2019) NCCo,.9-optimized GPU (TX2) 6497.1 
Chang et al. (2020)  Z?-ZNCC GPU (TX2) 1669.2 
Hern.-J. et al. (2016) CTox7-SGM GPU (TX1) 747.1 
ReS2tAC-CUDA CT5,5-SGM GPU (AGX) 652.7 
ReS2tAC-CUDA CTox7-SGM GPU (AGX) 644.9 
ReS?tAC-CUDA CT;xs-SGM-fine GPU (AGX) 640.9 
ReS?tAC-CUDA CTox7-SGM-fine GPU (AGX) 633.1 
ReS?tAC-CUDA CTsx5-SGM-exact-cc GPU (AGX) 365.7 
ReS?tAC-CUDA CTgx7-SGM-exact-cc GPU (AGX) 361.8 
ReS2tAC-CUDA NCCs,.5-SGM GPU (AGX) 442.4 
ReS2tAC-CUDA NCCo,9-SGM GPU (AGX) 344.1 
ReS?tAC-NEON CT;x5-SGM CPU 166.2 


they show the small computational overhead of the subpixel disparity refinement (fine), as 
discussed in Section 3.4.1.3. However, the measurements also unveil that the presented opti- 
mizations are less efficient than those from the literature. As expected, the implementations 
which are optimized and deployed on FPGA architectures are superior to those running on 
an embedded GPU. Furthermore, the superiority in terms of throughput of algorithms, such 
as those from Cui and Dahnoun (2019) and Chang et al. (2020), that do not rely on a com- 
plex regularization scheme like the SGM, is also to be expected. Nonetheless, ReS?tAC has 
a lower throughput than a similar implementation of Hernandez-Juarez et al. (2016), while 
simultaneously being deployed on a more powerful system. (Ruf et al. 2021a, Sec. 3.1.4.) 


A common way to further increase the throughput of the SGM algorithm is to reduce the 
number of aggregation paths. Most of the implementations aggregate the matching costs 
for each pixel along eight concentric paths. However, studies (Banz et al. 2010, Hernandez- 
Juarez et al. 2016) suggest that a reduction of the aggregation paths from eight to four does 
not have a significant negative impact on the accuracy of the resulting disparity map, while 
greatly increasing the processing speed. This is also supported by the experiments done in 
this work, in which the diagonal aggregation paths of the SGM optimization were omitted 
in the regularization of the cost volume, since they are the longest ones. The results of the 
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Table 3.7: Throughput and accuracy achieved by ReS?tAC with a reduced number of aggregation paths in the SGM 
optimization. Instead of eight aggregation paths, only the two horizontal and the two vertical paths were 
used. The accuracy in terms of error rate and density was measured on the KITTI 2015 stereo benchmark. 
(Ruf et al. 2021a, Tab. 7.) 


Approach Configuration HW Device Throughput D1-all (in x) Density 
(in MDE/s) Est. (in %) 
ReS?tAC-CUDA CTsys-4-Path-SGM GPU(ACX) 924.1 (x1.42) 4.3(+0.0) 88.1(-0.7) 


) 

ReS?tAC-CUDA CTox7-4-Path-SGM GPU (AGX) 904.4 (x1.40) 4.2(+0.2) 89.2 (-0.8) 
ReS?tAC-CUDA NCC.,s-4-Path-SGM GPU (AGX) 615.4 (x1.39) 4.3(+0.0) 88.6 (-1.4) 
ReS?tAC-CUDA NCCo,.9-4-Path-SGM GPU (AGX) 436.5 (x1.27) 4.6(+0.2) 89.5 (-1.6) 
ReS*tAC-NEON  CTsxs-4-Path-SGM CPU 241.8 (x1.45) 4.8(+0.2) 89.3 (-0.7) 


conducted experiments are listed in Table 3.7, showing an increase in the throughput by a 
factor of up 1.45, while reducing the accuracy and the density by a maximum of 0.2% and 
1.6 %, respectively. In this, the approach of Hernandez-Juarez et al. (2016), which is compara- 
ble to the approach of ReS*tAC, is outperformed in terms of both accuracy and throughput. 
(Ruf et al. 2021a, Sec. 3.1.4) 


With the throughput listed in Tables 3.6 and 3.7, the expected FPS, which are to be achieved 
for sets of different image sizes and disparity ranges according to Equation (3.4), are calcu- 
lated and plotted as curves in Figure 3.15. Here, not all configurations of ReS*tAC are plotted. 
Instead, one configuration for each cost function and hardware as well as the corresponding 
versions with only four paths in the SGM optimization were selected. Additionally, one con- 
figuration that performs a subpixel disparity refinement was selected for comparison. Fur- 
thermore, the FPS curves for the related approaches from literature deployed on FPGA (f) and 
GPU (x) architectures are plotted, as well as one curve achieved by ReS?tAC on a high-end 
desktop NVIDIA RTX 2070 Super GPU. The image sizes and disparity ranges corresponding 
to the KITTI 2015 and Middlebury 2014 Q benchmark are printed in bold. The curves pro- 
vide a good visual representation of the throughput listed in Tables 3.6 and 3.7 and show that 
with decreasing complexity, i.e. a reduced image size and disparity range, the frame rates in- 
crease rapidly. Thus, the curves reveal how different configurations, approaches and utilized 
hardware compare in terms of processing speed. (Ruf et al. 2021a, Sec. 3.1.4) 


A key characteristic of embedded systems is their power consumption and, depending on 
which platform the system is deployed, this can be a very crucial characteristic. The simple 
metric of FPS per watt (FPS/W) helps to quantify the efficiency of image processing algo- 
rithms with respect to the power consumption of the system on which they are deployed. 
The NVIDIA Jetson Xavier AGX, on which ReS?tAC is deployed, allows to set four different 
power settings, namely: 
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MAXN This is the setting enabling the maximum performance. With this, all eight cores of 
the ARM CPU are activated and can clock up to a maximum of 2.3 GHz. The maximum 
clock rate of the GPU is set to 1.4 GHz. This is the setting with which all previous 
experiments were conducted. 


30W In this setting, again all eight cores of the CPU are enabled. However, they are re- 
stricted to a maximum clock rate of 1.2 GHz. Furthermore, the clock rate of the GPU 
is restricted to 905 MHz. 


15W In this setting, four cores of the CPU are enabled which clock at a maximum rate of 
1.2 GHz, while the GPU clocks up to 675 MHz. 


10 W In the smallest setting, only two cores of the CPU are enabled with a maximum of 
1.2 GHz and the clock rate of the GPU is restricted to only 522 MHz. 


In Figure 3.16, the FPS/W which are expected to be achieved by different configurations of 
ReS?tAC as well as by some approaches from the literature are plotted, again, in dependence 
of different image sizes and disparity ranges. These calculations are based on the through- 
put and power consumption measured or stated. In case of ReS?tAC, the two configurations 
reaching the highest throughput on the GPU and the CPU were selected. In these experi- 
ments, the throughput and power consumption, the latter being provided by internal sensors 
of the AGX, were measured given different power settings. Note that the actual power con- 
sumption on the AGX does not coincide with the statement of the power setting, as the latter 
one only indicates an upper bound on the consumption. From literature, approaches which 
provide a value on the power consumption in addition to the throughput or frame rate were 
chosen for comparison. Unfortunately, in terms of related work that involves deploying a 
stereo algorithm on embedded GPUs, this was only done by Hernandez-Juarez et al. (2016). 
(Ruf et al. 2021a, Sec. 3.1.4) 


The curves in Figures 3.15 and 3.16 clearly reveal the superiority of FPGA-based approaches 
over those deployed on GPUs. Not only do they achieve much higher frame rates, but also re- 
quire significantly less power and, in turn, also achieve higher FPS/W. However, the emerging 
embedded GPUs also achieve quite reasonable frame rates with respect to their power con- 
sumption, and depending on where the systems are deployed, e.g. quadrotor-based systems, 
the power required by the GPU is negligible, when compared to that required by the rotors. 
But more on this is provided in the discussion on what the results mean for a possible use-case 
(cf. Section 3.5). Interestingly, the curves in Figure 3.16 show that both the CUDA and NEON 
implementation of ReS*tAC achieve the best efficiency on the 30 W power setting. Even the 
CUDA implementation being run on the 15 W power setting is still more efficient than the 
same run on the setting with maximum performance. It is assumed that this is the result of 
clocking down the CPU, since it is less efficient than the GPU. Nonetheless, the 30 W and 
15 W power settings reduce the throughput by approximately 29% and 42%, respectively, 
compared to that achieved on MAXN. (Ruf et al. 2021a, Sec. 3.1.4.) 
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—+— ReS?tAC- CUDA - CTox7 - 4-Path-SGM (MAXN) 
—*— ReS?tAC- NEON - CTsyxs - 4-Path-SGM (MAXN) 
—+— ReS?tAC- CUDA - CTox7 - 4-Path-SGM (30W) 
—*— ReS?tAC- NEON - CTsys5 - 4-Path-SGM (30W) 

- —— ReS?tAC- CUDA - CTox7 - 4-Path-SGM (15W) 
—*— ReS?tAC- NEON - CTsxs - 4-Path-SGM (15W) 
—*— ReS?tAC- CUDA - CTox7 - 4-Path-SGM (10W) 
—*— ReS?tAC- NEON - CTsys5 - 4-Path-SGM (10W) 
---- Hernandez-Juarez et al. (2016) - CTox7 - SGM * 
--*- Rahnama et al. (2018a) - CT5x5 - MGM + 

®- Zhao et al. (2020) - CTsxs5 - SGM + 
Ruf et al. (2018a) - CTsx5 - SGM + 


14 


12 


10 


FPS/W 


Width x Height x Disparities 
Figure 3.16: Expected FPS/W for sets of different image sizes and disparity ranges, based on the throughput and the power consumption achieved by different 
configurations and approaches. In case of ReS*tAC, the power settings of the AGX were varied in order to measure its efficiency. Image resolution 
and disparity ranges corresponding to the KITTI 2015 and Middlebury 2014 Q benchmark are printed in bold. x: Approaches from literature deployed 
on embedded GPU hardware. t: Approaches from literature deployed on embedded FPGA hardware. (Ruf et al. 202 1a, Fig. 14.) 
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3.4.2 Qualitative and Quantitative Evaluation of Real-Time Stereo 
Processing On-Board Commodity UAVs 


As part of use-case-specific experiments, in which the aim is to bring real-time stereo process- 
ing with an embedded CUDA device on board a UAV, a DJI Matrice 210v2 RTK was equipped 
with a DJI Manifold 2-G processing unit (cf. Figure 1.1), which is based on the NVIDIA Jetson 
TX2 architecture containing a 4-core 64-bit ARMv8 CPU and a 256-core Pascal GPU. As a 
stereo camera, the integrated stereo vision sensor was used, which can be accessed by the 
Manifold through the DJI onboard SDK. (Ruf et al. 2021a, Sec. 3.2.) 


In the maximum power setting (MAXN), the ARM Cortex A57 CPU of the TX2 inside the 
Manifold has four cores with a maximum clock rate of 2 GHz, while the built-in Tegra GPU 
clocks up to a maximum of 1.3 GHz. The integrated vision sensor provides non-rectified, 
grayscale stereo image pairs with an image resolution of up to 640 x 480 pixels at a frame 
| rate of 20 FPS. To calibrate the stereo sensor and, in turn, precompute the rectification maps 


Figure 3.17: Qualitative results of ReS2tAC run on the DJI Manifold 2-G mounted on the DJI Matrice 210v2 RTK, 
using the data of the stereo vision sensor as input. In this, the UAV was flying 1-2 m above the ground in 
order to demonstrate the ability of ReS?tAC to appropriately estimate the scene depth. Top row: Rec- 
tified reference images. Bottom row: Corresponding disparity maps, color-coded with the jet color 
map, going from red (0.9 m), over yellow and green to blue (57 m). (Ruf et al. 2021a, Fig. 16.) 
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Table 3.8: Throughput and frame rates achieved by two configurations of ReS?tAC on the DJI Manifold equipped 
with a Jetson TX2. (Ruf et al. 2021a, Tab. 8.) 


Approach Configuration HW Device ‘Throughput Frame Rate (in FPS) 

(in MDE/s) 640x480x64px 320x240 x 64px 
ReS*tAC-CUDA CT »,7-4-Path-SGM GPU (TX2) 304.7 15.5 62.0 
ReS?tAC-NEON CTsys-4-Path-SGM CPU 102.2 5.2 20.8 


needed to transform the input images into a rectified stereo pair prior to the actual stereo 
processing, the standard calibration routine of OpenCV was used. The integrated stereo vi- 
sion sensor also provides precomputed disparity maps at a frame rate of 10 FPS and with an 
image resolution of 320 x 240 pixels (DJI 2021). (Ruf et al. 2021a, Sec. 3.2.) 


In the scope of this work, two configurations of ReS*tAC were deployed and tested, one 
running on the GPU and the other on the CPU, namely: ReS*tAC-CUDA with CToy7 - 4- 
Path-SGM and ReS*tAC-NEON with CT;,.5 - 4-Path-SGM. Both configurations rely on the 
Hamming distance of the census transform as a cost function and use only four paths in the 
SGM optimization in order to reach a higher throughput. The throughput and frame rates 
as well as qualitative results achieved by the two configurations are listed in Table 3.8 and 
shown in Figure 3.17, respectively. (Ruf et al. 2021a, Sec. 3.2.) 


3.5 Discussion 


When assessing the usability of embedded stereo algorithms for the deployment and usage in 
actual applications, the accuracy, processing speed as well as power consumption are crucial 
characteristics. Thus, in the following, the experimental results are discussed with respect to 
each of these aspects, namely accuracy (Section 3.5.1), processing speed (Section 3.5.2) and 
power consumption (Section 3.5.3). 


3.5.1 Accuracy 


The quantitative evaluation on the KITTI 2015 stereo benchmark (Table 3.2) reveals a high 
and state-of-the-art accuracy of ReS*tAC, both in the actual estimated disparity map (Est), 
in which inconsistent regions are removed, as well as in the interpolated versions (All). The 
latter are being used as part of the standard evaluation routine of the benchmark. Under- 
standably, the accuracy of the interpolated disparity maps is used as a primary ranking in the 
benchmark, since it allows to compare disparity maps with different densities, however, in 
the assessment of the performance of ReS*tAC, the accuracies of the filtered disparity maps 
are of greater importance. In particular, when looking at the qualitative results in Figure 3.13 
and the stated densities of the estimated disparity maps, it becomes clear that the few areas, 
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that are being removed by the post-filtering, mostly arise in occluded areas and are thus legit- 
imately removed, since it is not possible for the algorithm to reason about the depth in areas 
which are only seen by one camera. (Ruf et al. 2021a, Sec. 4.1.) 


Unfortunately, the results of the Middlebury 2014 stereo benchmark render the accuracy of 
ReS?tAC far from the state-of-the-art. With an error rate of over 35 % for three out of four 
accuracy levels, the results are not really satisfying. Apart from the high accuracy levels in 
the evaluation and the high-resolution ground truth, it is assumed that these poor results can 
also be attributed to the fact that the resolution, with which the disparity maps are being 
evaluated, is 4x bigger than the input resolution, introducing a lot of error in the process of 
upscaling and interpolation. Thus, the results on ground truth data with only a quarter of the 
original image resolution were also evaluated. They reveal that the error rate is on average 
reduced by way over 50%. For the configuration with a 5 x 5 census transform, being the 
best-performing configuration on this dataset, the accuracies calculated are 35.8 % (bad0.5), 
14.2 % (bad1), 7.4 % (bad2) and 4.9 % (bad4), which is in the comparable range to that obtained 
on the KITTI benchmark. (Ruf et al. 2021a, Sec. 4.1.) 


As the name suggests, ReS?tAC is intended for real-time, rather than high accuracy, stereo 
processing. The main use-case of this work is the deployment on COTS UAVs equipped with 
embedded ARM or CUDA hardware, with the purpose of obstacle detection and collision 
avoidance. In the light of this use-case, the KITTI benchmark is more appropriate than the 
Middlebury benchmark, since it comprises image data that depict real-world scenes in a qual- 
ity that can also be expected from cameras mounted on UAVs. Moreover, for collision avoid- 
ance, it is not necessary to have disparity maps with high subpixel accuracy as evaluated by 
the Middlebury benchmark. It is more important to reliably detect the location of objects in 
the perceived scene and extensively reconstruct their appearance. Not only do the quanti- 
tative results prove the high accuracy of the disparity maps, the qualitative presentation of 
some of the disparity maps also shows that ReS*tAC is able to reveal objects, which are only 
visible by a second glance, such as the person on the right of Figure 3.13 (Row 2, Column 2) 
or Figure 3.17 (Column 2). (Ruf et al. 2021a, Sec. 4.1.) 


With respect to the KITTI 2015 benchmark, most of the configurations of ReS?tAC outper- 
form the other approaches that perform real-time stereo estimation on embedded hardware. 
Even compared to the baseline algorithms, the results are compatible, especially when con- 
sidering the frame rates achieved. Furthermore, the conducted experiments on the effects 
of subpixel disparity refinement (cf. Tables 3.4 and 3.5) show, that its use can increase the 
accuracy by up to 35% without significantly decreasing the throughput (cf. Table 3.6). (Ruf 
et al. 2021a, Sec. 4.1.) 
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3.5.2 Processing Speed 


Compared to the related work, the throughput and processing speed of ReS*tAC are not 
very impressive. It is to be expected that a significantly lower throughput is reached by 
ReS*tAC, compared to that achieved by approaches running on FPGAs (Zhao et al. 2020, 
Rahnama et al. 2018a), as well as a slightly lower throughput compared to approaches that do 
not rely on a computationally expensive optimization scheme but are run on an embedded 
GPU (Cui and Dahnoun 2019, Chang et al. 2020). However, the CTox7 - SGM configuration 
of ReS*tAC has a lower throughput than that of a comparable configuration by Hernandez- 
Juarez et al. (2016), while at the same time running on a hardware generation that is two 
times newer and that has twice the number of CUDA cores. This is not very satisfactory 
and something that will need to be further investigated in the scope of future work. One 
difference between the work of Hernandez-Juarez et al. (2016) and this one is that in the 
time measurements conducted in the scope of this work the data transfer to and from the 
memory of the GPU is included. Hernandez-Juarez et al. (2016) argue that the processing 
can be overlapped with the computation and thus is not relevant for the computation of 
the throughput. However, in the case of this work, the data transfer only takes up 4-6% 
of the processing time, which is not enough to reach the throughput of Hernandez-Juarez 
et al. (2016), if omitted. Other optimization steps are the utilization of SIMD instructions 
to vectorize the cost aggregation with the CUDA kernels, or to streamline the aggregation 
of the last path and the disparity computation to reduce memory access, which should lead 
to a 1.35X performance speed-up (Hernandez-Juarez et al. 2016). A third and significant 
difference between ReS*tAC and the approach of Hernandez-Juarez et al. (2016) with respect 
to performance is that Hernandez-Juarez et al. (2016) refrain from any post-processing like 
left-right consistency check or median filter. This allows to reach a higher throughput but, 
in turn, reduces the accuracy of the results, leading to an error-rate which is twice as high as 
that of ReS*tAC in terms of the actual estimates (cf. Table 3.2). (Ruf et al. 2021a, Sec. 4.2.) 


Furthermore, in the light of the use-case, addressing on-board stereo processing on COTS 
UAVs, the throughput reached to this end by ReS?tAC is sufficient, as another limiting factor 
is the camera sensor and the data throughput between the sensor and processing board that 
is provided by commodity systems. Since a FPGA is typically located closer to the sensor 
with a direct and high-bandwidth connection, which allows to stream the image data directly 
into the memory of the FPGA, a high data throughput is of greater importance. However, 
embedded systems equipped with a GPU, like the NVIDIA Jetson series, that are mounted on 
COTS UAVs, are usually connected to the sensor via USB or similar, with the CPU capturing 
the data and storing it in global memory, from where it gets transferred to the device memory 
of the GPU before it can be processed. This process does not allow for a high input frame rate 
(usually between 20 FPS and 30 FPS), as can be seen in the example of the DJI Matrice (cf. 
Section 3.4.2), and therefore does not require an extremely high throughput. Nonetheless, 


66 


3.5 Discussion 


the more time spent on the estimation of the disparity map, the less time is available for the 
successive interpretation, e.g. obstacle detection and avoidance. This raises the interest to 
further looking into the optimization of the processing speed in the future. (Ruf et al. 2021a, 
Sec. 4.2.) 


3.5.3 Power Consumption 


The final aspect, which is discussed, is the power consumption of ReS*tAC running on the 
NVIDIA Jetson Xavier AGX in the light of deployment on a rotor-based UAV and whether 
it is feasible to use embedded CPUs for stereo processing. The average power consumption 
of the complete NVIDIA Jetson Xavier AGX (i.e. including the GPU, the CPU and the EMC) 
under the maximum power setting MAXN during the execution of the CUDA- and NEON- 
based implementations is approximately 20.1 W and 17.9W, respectively. The DJI Matrice 
210v2 RTK is powered by two batteries with a total energy of 349.2 Wh, which allows a 
maximum flight time of 33 minutes when no payload is attached (DJI 2020a). Thus, during 
flight, the bare DJI Matrice 210v2 RTK consumes around 634.9 W per minute. This power 
consumption obviously increases with each gram of payload that is being attached. Given 
these measurements, it can be calculated that the power consumption of the AGX under 
the highest power setting only makes up 3.2% and 2.8% with respect to the total power 
consumption of the DJI Matrice, thus reducing the flight time by a maximum of 1 minute, 
when the CUDA- and NEON-based implementations are executed, respectively. This is an 
upper bound on the relative power consumption of the AGX, as the power consumption 
during flight of the DJI increases when payload is attached. The use of other power settings, 
that will increase the FPS/W ratio (cf. Figure 3.16), but also decrease the absolute frame rate, 
is dependent on the use-case and on whether the gain of few seconds in flight time is more 
valuable than a major reduction in frame rate. (Ruf et al. 2021a, Sec. 4.3.) 


The power consumption during the execution of the presented CUDA-based implementation 
is higher than during the execution of the NEON-based implementation. This is expected 
as the GPU, which consumes more power than the CPU, is clocked down when not in use. 
However, the reduced power consumption does not stand in relation to the loss in processing 
speed of the NEON-based implementation compared to the ones based on CUDA. This is 
clearly revealed by the curves in Figure 3.16, depicting that the NEON-based implementations 
running on the CPU have the worst FPS/W ratio. Thus, it can be concluded that the use of 
embedded GPUs is preferred over embedded CPUs. However, some drones, e.g. VOXL’, are 
only equipped with an embedded ARM CPU for which a vectorized stereo processing with 
NEON intrinsics is an option. (Ruf et al. 2021a, Sec. 4.3.) 


* https://www.modalai.com/pages/voxl 
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3.6 Conclusion 


In conclusion, with ReS?tAC, this work presents an approach for real-time stereo process- 
ing on embedded ARM and CUDA devices, such as those attached to modern COTS UAVs. 
In this, a disparity estimation algorithm, which is based on the SGM approach (Hirschmüller 
2005, Hirschmiiller 2008), is optimized for embedded CUDA GPUs, such as the NVIDIA Tegra, 
by general-purpose computation on a GPU, as well as for embedded ARM CPUs by utiliz- 
ing the NEON intrinsics for vectorized SIMD processing. It is demonstrated, that ReS*tAC 
reaches state-of-the-art accuracies when evaluated on public stereo benchmark datasets. The 
CUDA-based implementation for stereo processing on embedded GPUs reaches real-time per- 
formance, even though it does not outperform related work in terms of processing speed. 
The frame rates of the NEON-based implementation, however, outperform all related work 
on stereo processing on embedded CPUs. In a use-case-specific scenario, the suitability of 
ReS?tAC is demonstrated for real-time stereo estimation on a COTS UAV, namely the DJI 
Matrice 210v2 RTK equipped with a DJI Manifold 2-G. (Ruf et al. 2021a, Sec. 5.) 


Furthermore, in the scope of this work, it is demonstrated that in case of rotor-based UAVs 
a modern embedded GPU is a suitable alternative to an embedded FPGA, especially due to 
its shorter and thus less expensive development cycles. Even though the GPU has a much 
greater power consumption than a FPGA and a significantly worse FPS/W ratio, its power 
consumption is negligible compared to the energy needed by rotor-based UAVs during flight 
and will reduce the flight time of the DJI Matrice 210v2 RTK by a maximum of 1 minute. 
However, for embedded systems with stricter power constraints, a FPGA-based approach 
should be considered. The experiments have also shown that, although the CPU requires less 
energy than the GPU, it has the worst FPS/W ratio. Thus, the optimization based on NEON 
intrinsics for vectorized SIMD processing should only be used if neither GPU nor FPGA are 
available. (Ruf et al. 2021a, Sec. 5.) 


Finally, the experiments have revealed that a further investigation is needed in order to iden- 
tify which part of the CUDA-based optimization needs to be further optimized, since ReS?tAC 
does not reach the processing speeds of comparable approaches from the literature. Further- 
more, in future, other approaches, e.g. deep-learning-based algorithms that are capable of 
reaching higher accuracies than ReS*tAC, should also be considered. However, especially in 
case of approaches that are based on deep learning, experiments with respect to the general- 
ization to unknown scenes and deployment on commodity hardware need to be conducted. 
Studies with respect to the reliability of the predictions that are produced by approaches 
that are based on deep learning are particularly important when considering safety-critical 
applications. (Ruf et al. 2021a, Sec. 5.) 


As discussed, ReS*tAC is primarily aimed to facilitate the perception of the environment of 
a COTS UAV or other vehicles and, in turn, allow a safe and autonomous use by using the 
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estimated disparity or depth maps to perform obstacle detection and collision avoidance. A 
combination of ReS*tAC with a simple and yet effective algorithm for the task of obstacle 
detection and collision avoidance is presented and discussed in Section 6.1. Even though, in 
the scope of this work, the use of ReS?tAC is portrayed only with respect to a single use- 
case, there are also a lot of other possible applications, such as RGB-D odometry or rapid 3D 
mapping, that can be facilitated with a real-time disparity estimation algorithm. 
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4 Fast Multi-View Stereo Depth Estimation 
from Monocular Video Data 


This chapter includes material from the following publication: 


Ruf, B.; Weinmann, M., and Hinz, S. (2021b): “FaSS-MVS - Fast multi-view stereo with surface-aware semi- 
global matching from UAV-borne monocular imagery”. In: arXiv preprint arXiv:2112.00821v1. Reprinted with 
permission. It is cited as (Ruf et al. 2021b) and marked with a green sidebar. 


In the second part of the proposed framework (cf. Section 1.2), the visual payload camera 
is used to facilitate two high-level applications, namely rapid 3D mapping and 3D change 
detection. These applications, in turn, aim at supporting first responders during disaster 
relief or search-and-rescue (SAR) missions, by allowing a quick and large-scale assessment 
of the situation or enabling the monitoring of areas which are inaccessible for ground forces 
(Restas 2015, Furutani and Minami 2021). In this, image-based techniques and photogram- 
metry based on aerial reconnaissance are a key element in supporting the rescue workers, 
provided that the environmental conditions, e.g. whether and daytime, allow for a visual 
inspection (Furutani and Minami 2021). There exists a large collection of software toolboxes, 
such as COLMAP (Schönberger and Frahm 2016, Schönberger et al. 2016), for performing of- 
fline photogrammetric 3D reconstruction allowing to accurately reconstruct the disaster site 
from aerial imagery. Their focus, however, is primarily on offline and accurate processing. 
This hinders the use for a rapid 3D mapping during the image acquisition, due to the required 
run-time for high-accurate 3D modeling and the consideration of all input images for pro- 
cessing. However, in order to efficiently aid first responders in their tasks, the run-time of 
the algorithms matters which, in turn, raises the need for efficient, fast and incremental 3D 
mapping in order to support a rapid assessment. Here, the availability of 3D data, for exam- 
ple, allows to reason on damage caused by an incidence, or structural integrity of a partly 
collapsed building, as well as route planning through areas which are difficult to access or to 
account for 3D geometry when creating an orthographic map. (Ruf et al. 2021b, Sec. 1.) 
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4.1 Contributions and Outline 


To accommodate the above-mentioned applications, this chapter presents a novel approach 


for fast multi-view stereo with surface-aware Semi-Global Matching, denoted as FaSS-MVS. 


The approach 


e uses plane-sweep sampling to perform hierarchical dense multi-image matching, 


e utilizes and extends the widely used SGM algorithm (Hirschmüller 2005, Hirschmiiller 


2008) to favor not only fronto-parallel surfaces in the computation of dense depth 
maps, by incorporating a surface-aware regularization based on local surface normals, 


efficiently computes dense depth, normal and confidence maps from image sequences, 
allowing to facilitate the task of incremental UAV-borne 3D mapping, and 


is quantitatively evaluated on two public datasets for dense multi-view stereo (MVS) 
with accurate ground truth and, additionally, results are presented for two 
use-case-specific datasets. 


And even though FaSS-MVS is proposed with the discussed use-case in mind, it is not re- 


stricted to airborne data and can also be used to perform an incremental and online 3D 


mapping of an environment captured by a ground-based robot or sensor system. (Ruf et 
al. 2021b, Sec. 1.) 


The contributions presented in this chapters have partly been published in: 
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« Ruf, B.; Weinmann, M., and Hinz, S. (2021b): “FaSS-MVS - Fast multi-view stereo with 


surface-aware semi-global matching from UAV-borne monocular imagery”. In: arXiv 
preprint arXiv:2112.00821v1. Published as preprint. Cited as (Ruf et al. 2021b). 


Ruf, B.; Pollok, T., and Weinmann, M. (2019): “Efficient surface-aware semi-global 
matching with multi-view plane-sweep sampling”. In: ISPRS Annals of the Photogram- 
metry, Remote Sensing and Spatial Information Sciences IV-2/W7, pp. 137-144. Peer- 
reviewed on the basis of the full paper. Cited as (Ruf et al. 2019). 


Ruf, B.; Thiel, L., and Weinmann, M. (2018b): “Deep cross-domain building extraction 
for selective depth estimation from oblique aerial imagery”. In: ISPRS Annals of the 
Photogrammetry, Remote Sensing and Spatial Information Sciences IV-1, pp. 125-132. 
Peer-reviewed on the basis of the full paper. Cited as (Ruf et al. 2018b). 


Ruf, B.; Erdnüß, B., and Weinmann, M. (2017): “Determining plane-sweep sampling 
points in image space using the cross-ratio for image-based depth estimation”. In: In- 
ternational Archives of the Photogrammetry, Remote Sensing and Spatial Information Sci- 
ences XLII-2/W6, pp. 325-332. Cited as (Ruf et al. 2017). 


4.2 Related Work 


This chapter is structured as follows: In Section 4.2, the related work on incremental image- 
based 3D mapping for online processing as well as modern approaches for learning-based 
dense image matching (DIM) and MVS are briefly summarized. In this, it is also delineated, 
how the presented approach differs from those found in the related work. In Section 4.3, the 
overall processing pipeline of the presented approach is illustrated and outlined with a quick 
overview. This is followed by a detailed description on the implementation and methodology 
of the individual steps of the processing pipeline. The approach is quantitatively and quali- 
tatively evaluated on two public and two private datasets. The datasets, the error metrics as 
well as the results of the conducted experiments are presented in Section 4.4. Subsequently, 
the findings are discussed and put into context of the considered use-case in Section 4.5, be- 
fore providing a summary and concluding remarks as well as a short outlook on future work 
in Section 4.6. (Ruf et al. 2021b, Sec. 1.1.) 


4.2 Related Work 


Due to the ever-increasing demand for detailed 3D models, the research in the fields of pho- 
togrammetry, remote sensing and computer vision has brought up a number of software 
suites and applications, that focus on the estimation of accurate and dense depth and geom- 
etry information from a large set of input images, by means of DIM and MVS. Prominent 
and widely used representatives of such applications are MVE (Goesele et al. 2007), PMVS 
(Furukawa and Ponce 2010), SURE (Rothermel et al. 2012, Wenzel et al. 2013b), COLMAP 
(Schénberger et al. 2016) and OpenMVS (OpenMVS 2022), to name a few. These approaches, 
however, are designed for offline processing, aiming at the accuracy and completeness of 
the resulting 3D model, while assuming that all input data is available at the time of recon- 
struction and that no critical constraints on the computation time or hardware resources are 
set. In contrast, the aim of FaSS-MVS is to extract dense depth and geometry information 
from image sequences, while they are acquired. Or at least while the image data stream is 
received. For example, in case a direct processing is not possible due to the acquisition by 
a small UAV and its limited hardware resources. Thus, the focus lies in the incremental and 
online processing of the input data by DIM and MVS. (Ruf et al. 2021b, Sec. 1.2.) 


4.2.1 Incremental Camera-Based Mapping for Online Processing 


Early work on incremental and online camera-based mapping of the local environment was 
mainly driven by robotic and augmented reality (AR) applications (Klein and Murray 2007, 
Davison et al. 2007, Eade and Drummond 2006). Here, the main goal was to robustly localize 
the camera pose, and in turn the sensor carrier, with respect to its surrounding, in order to 
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navigate through the environment or enhance the camera images with additional informa- 
tion. Since the focus of these so-called simultaneous localization and mapping (SLAM) algo- 
rithms is the estimation ofthe camera pose and trajectory, the detailed and dense mapping of 
the environment was rather of secondary interest. Thus, these approaches mainly relied on 
point features for the tracking and mapping rather than direct pixel matching. However, in 
order to provide a convincing AR experience, a dense and detailed model of the environment 
is essential. Subsequent works (Newcombe and Davison 2010, Newcombe et al. 2011) have 
proposed a dense mapping simultaneous to the acquisition of the image data and the local- 
ization of the camera, resulting in a detailed reconstruction of a small AR workspace. Since 
these approaches, however, aim to reconstruct rather small-scale environments, they make 
use of short baseline video clips for the image matching, which in turn allows to rely on dense 
optical flow methods to find dense pixel correspondences (Newcombe and Davison 2010). In 
contrast, as input to FaSS-MVS, it is assumed to have image data captured by a UAV, which 
is typically flying several tens of meters away from the object of interest. Thus, FaSS-MVS 
is rather aimed to densely map a large-scale environment, which in turn hinders to track 
pixel-wise correspondences between consecutive frames, but requires a wide-baseline image 
matching instead. However, FaSS-MVS is not solely restricted to large-scale environments 
and a wide-baseline image matching, as experiments with respect to the employed multi- 
image matching, done in other work (Ruf et al. 2017), show. (Ruf et al. 2021b, Sec. 1.2.1.) 


Early work on camera-based mapping and reconstruction of urban surroundings was done by 
Gallup et al. (2007) and Pollefeys et al. (2008), who employed the plane-sweep algorithm for 
true multi-image matching (cf. Section 2.1.3) to map and reconstruct building facades from 
images captured by a vehicle-mounted camera in real-time. In this, they rely on vanishing 
points, which are detected in the input images, and on data from an additional IMU to re- 
cover the orientations of the building facades and the ground plane relative to the camera. 
To find the optimal plane configuration for each pixel and, in turn, extract a depth map from 
the results of the DIM, Pollefeys et al. (2008) employ a Bayesian formulation with a subse- 
quent selection of the WTA solution, while Gallup et al. (2007) minimize a formulated energy 
functional. In (Pollefeys et al. 2008), the estimation of the camera poses is done by using a 
Kanade-Lucas-Tomasi (KLT) feature tracker. A big advantage in the mapping and reconstruc- 
tion of urban areas is that most objects in such scenery can be approximated well by planar 
structures, which is why plane-sweep DIM is well-suited for this task. Other approaches 
for urban reconstruction from ground-based imagery, like those from Furukawa et al. (2009), 
Sinha et al. (2009) and Gallup et al. (2010), perform a piece-wise planar reconstruction by 
fitting multiple, differently oriented planes into the scene and optimizing photometric con- 
sistency. In this, they minimize an energy functional by a graph-cut algorithm, which takes a 
couple of minutes on a commodity CPU. Algorithms having a couple of minutes run-time to 
estimate a single depth map do not seem to be suitable for fast and online processing at first 
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sight. However, depending on their ability to be parallelized and optimized for the execution 
on a GPU, they might be useful after all. (Ruf et al. 2021b, Sec. 1.2.1.) 


Around the same time as the previously mentioned work was released, Hirschmüller (2005) 
and Hirschmüller (2008) proposed the so-called SGM algorithm (cf. Section 2.2.1), which 
evolved into one of the most widely used approaches for both online and offline DIM, due 
its efficiency and convincing results. It has been deployed on both desktop (Spangenberg 
et al. 2014, Banz et al. 2011) and embedded (Hernandez-Juarez et al. 2016, Zhao et al. 2020, 
Ruf et al. 2021a) hardware as already discussed in Section 3.2. Moreover, the SGM algorithm 
is used in a wide range of applications, such as advanced driver assistance system (ADAS) 
(Spangenberg et al. 2014), real-time obstacle detection and collision avoidance on-board UAVs 
(Barry et al. 2015, Ruf et al. 2018a) and urban mapping and reconstruction from aerial imagery 
(Rothermel et al. 2012, Wenzel et al. 2013b, Haala et al. 2015). In their work, Sinha et al. (2014) 
combine the plane-sweep multi-image matching with the SGM algorithm to estimate dense 
and highly accurate disparity maps. In contrast to the presented approach, Sinha et al. (2014) 
use local slanted planes, which are extracted from feature correspondences, to create dispar- 
ity hypotheses and employ the SGM algorithm to recover a disparity map. They, evaluate 
their approach on a high-resolution stereo benchmark and achieve significant improvement 
over the standard SGM algorithm in both run-time and accuracy. The improvement in terms 
of run-time is attributed to the fact that the local plane-sweep allows to test a locally con- 
fined part of the complete disparity range for each pixel, thus reducing the computational 
complexity of the optimization within the SGM algorithm. Similar improvements to over- 
come the problem of high computational complexity due to a large disparity range, which is 
inherent to oblique aerial imagery, were done by Haala et al. (2015), by embedding the SGM 
into a hierarchical coarse-to-fine processing. (Ruf et al. 2021b, Sec. 1.2.1.) 


Even though a large number of urban environments can be well abstracted by piecewise pla- 
nar reconstructions, not all structures are fronto-parallel, meaning that their surface orienta- 
tions are not parallel to the image plane. In order to account for slanted surfaces, Kuschk and 
Cremers (2013), for example, have incorporated a second-order smoothness assumption into 
their energy function. The initial formulation of the SGM algorithm, however, only models a 
first-order smoothness term and thus favors fronto-parallel surfaces, leading to stair-casing 
artifacts when reconstructing slanted surfaces. Especially when aiming for a visually appeal- 
ing reconstruction of the environment, this is to be avoided. While Hermann et al. (2009) 
and Ni et al. (2018) propose to incorporate a second-order smoothness assumption into the 
formulation of the SGM energy function, Scharstein et al. (2017) propose a more simplistic 
and yet effective improvement to address this issue. More specifically, plane priors are used, 
which, for example, can be recovered from normal maps or point correspondences, to adjust 
the zero-cost transition within the path aggregation of the SGM, thus penalizing deviations 
from the surface orientation represented by the prior. The major advantage over the other 
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approaches is that the pixel-wise offset for the zero-cost transition can be calculated in ad- 
vance and is in its magnitude the same for opposite aggregation paths, making its use very 
efficient. (Ruf et al. 2021b, Sec. 1.2.1.) 


4.2.2 Learning of Dense Image Matching and Multi-View Stereo 
Reconstruction 


With the advancements and success of deep-learning-based methods in other topics of com- 
puter vision, remote sensing and photogrammetry, such as object detection, classification or 
image segmentation, it was just a matter oftime when the first learning-based approaches for 
the task of DIM and MVS, that would outperform state-of-the-art model-based approaches, 
would be presented. Early works (Han et al. 2015, Zbontar; LeCun, et al. 2016, Hartmann 
et al. 2017) use deep CNNs to learn similarity measures between image patches and, in turn, 
build up a 3D cost volume from which a disparity or depth map is extracted by conventional 
methods, e.g. SGM (Hirschmüller 2005, Hirschmüller 2008). (Ruf et al. 2021b, Sec. 1.2.2.) 


Early approaches to perform actual MVS with deep learning are the so-called MVSNet (Yao 
et al. 2018) and DeepMVS (Huang et al. 2018). Both use the plane-sweep algorithm to match 
the pixels of multiple input images based on learned features and similarity measures and to 
build up a cost volume, just like the conventional approaches. To regularize the computed 
cost volume and to extract the depth map, both use a 3D U-Net (Ronneberger et al. 2015). 
3D CNNs, such as the 3D U-Net, use a great amount of memory and are computationally not 
very efficient, which is why other approaches, such as the ones presented in (Yao et al. 2019, 
Yan et al. 2020), exchange the 3D U-Net by a cascade of 2D CNNs. Further approaches (Cheng 
et al. 2020, Gu et al. 2020) remedy the high memory consumption by the use of hierarchical 
coarse-to-fine processing, as also done by FaSS-MVS. In the construction of the cost volume, 
there also exist other strategies, such as using gated convolution (Yi et al. 2020) or reprojecting 
the image data into a 3D voxel grid (Ji et al. 2017). (Ruf et al. 2021b, Sec. 1.2.2.) 


What all these approaches have in common, however, is that they are trained in a supervised 
manner, requiring datasets with appropriate ground truth. Most of them are using, among 
others, the DTU MVS benchmark (Jensen et al. 2014), which also serves as evaluation dataset 
for FaSS-MVS. The availability and versatility of appropriate datasets, however, is not very 
high, especially with respect to real-world scenarios, which still greatly hinders the practical 
use of deep-learning-based MVS approaches. To overcome this problem, recent approaches, 
such as the ones presented in (Khot et al. 2019, Huang et al. 2021) try to train models in an un- 
supervised, or sometimes also denoted as self-supervised, manner. But again, their practical 
use and ability for generalization still needs more studies (Khot et al. 2019). These limitations 
are the reasons why learning-based approaches for the task of MVS are not yet practical for 
the considered use-case, namely to reliably assist emergency forces in the incremental and 
online mapping of the operational area. (Ruf et al. 2021b, Sec. 1.2.2.) 
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In summary, FaSS-MVS uses a plane-sweep algorithm similar to the one presented by Polle- 
feys et al. (2008) to perform efficient dense multi-image matching and employs an improved 
implementation of the SGM algorithm to extract the depth map from the results of the DIM. 
The use of a plane-sweep algorithm for the task of DIM is mainly motivated by its ability 
to create depth hypotheses by matching an arbitrary number of input images as well as the 
fact that it can efficiently be optimized for the massively parallel execution on GPUs, making 
particularly suitable for online processing. In the improved implementation of the SGM al- 
gorithm, this work, among others, adopts the approach presented by Scharstein et al. (2017) 
to account for non-fronto-parallel surfaces by adjusting the zero-cost transition based on 
surface information stored inside a normal map. Very similar to FaSS-MVS seems to be the 
approach from Roth and Mayer (2019). They also rely on the improvements proposed by 
Scharstein et al. (2017) and combine the SGM with a plane-sweep DIM. However, their work 
focuses on the estimation of disparity images from ground-based stereo image pairs and was 
only evaluated on synthetic scenes so far. Moreover, this work also proposes to reduce the 
fronto-parallel bias of the SGM algorithm, by adjusting the zero-cost transition in the path 
aggregation based on the gradient of the minimum cost path. The complete depth estima- 
tion pipeline is embedded in a hierarchical processing scheme, just as proposed by Haala et 
al. (2015), in order to reduce the computational complexity induced by a large scene depth 
inherent to oblique aerial imagery. (Ruf et al. 2021b, Sec. 1.2.2.) 


4.3 FaSS-MVS - Fast Multi-View Stereo with 
Surface-Aware Semi-Global Matching 


In Section 4.3.1, first an overview on the processing pipeline for fast multi-view stereo using 
plane-sweep multi-image matching and surface-aware SGM is given. A detailed descrip- 
tion of the dense multi-image matching by means of plane-sweep sampling is given in Sec- 
tion 4.3.2, followed by an in-depth description of the surface-aware extension of the SGM 
algorithm in Section 4.3.3. In addition to the depth map, FaSS-MVS also computes a nor- 
mal map and a confidence map which is described in Section 4.3.4 and Section 4.3.5, respec- 
tively. Lastly, in Section 4.3.6, the employed post-processing steps to filter further outliers 
are described. 


4.3.1 Overview on the full Processing Pipeline 


An overview of the processing pipeline of the presented approach for fast MVS, based on 
plane-sweep sampling and a surface-aware SGM optimization (FaSS-MVS) is illustrated by 
Figure 4.1. Given an input bundle (J, P), € Q, consisting of k input images J, extracted 
from an image sequence in sequential order, as well as corresponding camera poses P, the 
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Figure 4.1: Overview of the processing pipeline for Fast Multi-View Stereo with Surface-Aware Semi-Global Matching (FaSS-MVS). Given a bundle of images and 
corresponding camera poses (J, P), of an input sequence, a hierarchical MVS estimation is performed to recover a depth, normal and confidence map 
(D, N, €). (Ruf et al. 2021b, Fig. 1.) 
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approach computes depth, normal and confidence maps (D, N, C) for a defined reference 
image J,.¢, which is typically the middle one of the input bundle Q. In this, it is assumed that 
the input has been calibrated, i.e. that the images are free of lens distortion and that the full 
projection matrix Pk = K[R,; —R,C,] of each image is known. Here, Cg € R? denotes 
the locations of the camera centers with respect to a reference coordinate system Oef, while 
the row vectors of R; € SO(3) hold the normalized coordinate axes of the camera coordinate 
system Osam» as seen from O,.r. The intrinsic calibration matrix K is equal for all cameras, 
since it is assumed that the images are captured from a single camera. Typical for an MVS 
approach, it is assumed that the input images depict the scene which is to be reconstructed 
from slightly different viewpoints. (Ruf et al. 2021b, Sec. 2.1.) 


Before any processing, a Gaussian image pyramid with n pyramid levels is computed for each 
image of the input bundle, allowing a subsequent hierarchical processing. The lowest pyra- 
mid levels (l = 0) hold the input images with their original image size. While moving up the 
pyramid, the images are first blurred with a Gaussian filter of size of 3 x 3 pixels and with 
co = 1, before being scaled down by a factor of 0.5 in both image directions. Furthermore, 
the intrinsic calibration matrices are also adjusted by halving the focal length and the coordi- 
nates of the principal point in order to account for the reduced image size. This results in an 
augmentation of the input bundle Q by n — 1 additional sets. In the following, a superscript 
is used to mark the results and processes at a specific pyramid level. The pipeline is initial- 
ized at the coarsest pyramid level 1 = (n — 1) with the smallest image size, executing three 
subsequent computational parts at each pyramid level, thus resulting in a coarse-to-fine pro- 
cessing. (Ruf et al. 2021b, Sec. 2.1.) 


The first part of the actual processing, namely the depth estimation, computes a depth map D! 
and is, in turn, subdivided into a plane-sweep multi-image matching creating depth hypothe- 
ses and the SGM* optimization extracting the optimal depth from the set of hypotheses. The 
latter one adopts the SGM approach, which was first presented by Hirschmüller (2005) and 
Hirschmüller (2008), to the plane-sweep matching and extends it to account for non-fronto- 
parallel surface structures. Section 4.3.2 gives a detailed description on the employed plane- 
sweep algorithm, while the extension of the SGM algorithm is discussed in Section 4.3.3. A 
concluding depth refinement and median filter with a kernel size of 5 x 5 pixels is used to 
filter small outliers in the resulting depth map. (Ruf et al. 2021b, Sec. 2.1.) 


The second part of the hierarchical processing estimates a normal map N! from the previously 
computed depth map D!. Apart from being an additional output of FaSS-MVS, the normal 
map is also used as part of the SGM” optimization to account for the surface orientation in the 
next hierarchical iteration. The normal map is regularized by an appearance-based weighted 
Gaussian smoothing in order to smooth out small variations while preserving discontinuities. 
Section 4.3.4 describes the details on how the normal map is extracted from a single depth 
map. (Ruf et al. 2021b, Sec. 2.1.) 
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In the third part, a confidence map C! is computed, holding pixel-wise confidence scores in 
the interval of [0, 1] with respect to the depth estimates in D!. The final confidence scores are 
computed based on the surface orientation at the considered pixel. Details on the computation 
of €! are given in Section 4.3.5. (Ruf et al. 2021b, Sec. 2.1.) 


Inherent to a hierarchical coarse-to-fine processing, the depth map D! and normal map J! 
computed at level l are used to initialize the depth map estimation at the next pyramid level 
l — 1, as long as the lowest levels of the image pyramids are not yet reached, i.e. while | > 0. 
In this, D! and N! are upscaled with nearest neighbor interpolation to the image size of the 
next pyramid level, yielding D! and N’. Then, D! is first used to compute the pixel-wise 


sampling range ee of the multi-image plane-sweep algorithm at the next pyramid level. 
Here, the u is computed for each pixel p separately, based on the previous depth estimate 
di = D'(p) and a predefined window with a radius of Ad around d}: 

I = [d biin diia] with 


p,min? “ p,max 


l-1 _ Jl 
er = dp zu Ad, (4.1) 
U. = d + Ad. 


In the first iteration, the sampling range is set equally for all pixels and is parameterized by 
minimum and maximum scene depth: Ty = [d min, dmax]. The upscaled normal map N lis 
used by one of the proposed SGM extensions to account for the surface orientation within 
the scene. The final depth, normal and confidence maps are the outcome of the processing at 
the lowest pyramid level. They are denoted as D, N and € respectively, and have the same 
image size as the input images. (Ruf et al. 2021b, Sec. 2.1.) 


In a final post-processing step, a DOG filter (cf. Section 2.3.2.2) is used to unmask image 
regions, which do not have distinctive texture information that allows to perform a reliable 
matching, as well as a geometric consistency check (cf. Section 2.3.3). (Ruf et al. 2021b, 
Sec. 2.1.) 


4.3.2 Real-Time Dense Multi-Image Matching with Plane-Sweep 
Sampling 


Dense image matching refers to the process of finding a dense correspondence field between 
pixels of two or more images from which depth hypotheses are extracted (cf. Section 2.1). 
If the relative camera poses of input images are known, these correspondences can further 
be transformed into a depth map, which encodes the pixel-wise scene depth from the van- 
tage point of the reference camera for which the correspondence field and depth map are 
computed. As described in Section 2.1.3, Collins (1996) proposed the so-called plane-sweep 
algorithm for true multi-image matching to compute depth hypotheses. FaSS-MVS adopts 
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the plane-sweep algorithm for real-time multi-image matching and embeds it into a hier- 
archical coarse-to-fine processing scheme in order to efficiently handle large scene depths 
(Section 4.3.2.1). Moreover, FaSS-MVS utilizes the cross-ratio, which is invariant under per- 
spective projection, to efficiently compute the distance of the sampling planes with respect 
to the reference camera for an effective sampling of the scene-space (Section 4.3.2.3). (Ruf 
et al. 2021b, Sec. 2.2.) 


4.3.2.1 The Hierarchical Plane-Sweep Algorithm for Real-Time Multi-Image 
Matching 


The presented approach for hierarchical multi-image matching is based on the plane-sweep 
algorithm presented by Pollefeys et al. (2008) and is described in Algorithm 4.1. As part 
of the actual image matching, the Hamming distance of the census transform (CT) (Zabih 
and Woodfill 1994) (cf. Section 2.1.1) as well as a negated, truncated and scaled form of the 
normalized cross correlation (NCC) (Scharstein et al. 2017, Sinha et al. 2014) (cf. Section 2.1.2) 
are employed and evaluated as cost function C(-). Since FaSS-MVS considers a bundle of input 
images with an equal number of matching images to either side of the reference image, the 
approach presented by Kang et al. (2001) is adopted in order to account for occlusions, using 
the minimum aggregated matching cost of the left and right subset of the matching images. 
The resulting three-dimensional cost volume 8! is of size w! x h! x |6"|, with w! and h! 
being the width and height of the reference image and |ö!| being the number plane positions 
at which the matching is performed, all with respect to the current pyramid level l. In this, 
the cost volume S! is implemented as a dynamic cost volume (Haala et al. 2015) for all but the 
topmost pyramid level, since the sampling range Tip is determined independently for each 
pixel p. Nonetheless, the complete set of plane distances ô € [ô nin; max], deduced from Ty, 
are precomputed for each pyramid level l and are the same for all pixels. This, in turn, allows 
to precompute the homographic mappings for all planes II. The following sections describe 
how the bounding planes I min and II,,,, and the corresponding distances Onin and max are 
found (Section 4.3.2.2), as well as how the cross-ratio is used to find appropriate sampling 
steps within Iy (Section 4.3.2.3). (Ruf et al. 2021b, Sec. 2.2.1.) 


4.3.2.2 Determining the Bounding Planes Corresponding to the Given Depth 
Range 


As previously described, it is assumed that two bounding planes, namely Imin and Imax 
with corresponding distances dni, and Ômax, between which the scene is to be sampled, are 
known. In case of a fronto-parallel sampling strategy, i.e. Ny = (0 0 -1)' with respect to 
the local camera coordinate system, the distances 6,,;, and Ona, are equal to the minimum 
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Algorithm 4.1: Plane-sweep multi-image matching executed at a specific pyramid level 
l of the proposed hierarchical processing scheme. (Ruf et al. 2021b, Alg. 1.) 


Input Data: a calibrated image bundle Q! at the pyramid level I, a set of planes II with 
a normal vector Ny and varying distances 6 and a local depth sampling 
range te = fä sacs dsl 

Result: three-dimensional cost volume S, holding the pixel-wise matching score for 

each pixel pf € IL and plane II. 


- determine bounding planes Imin and Imax located at 6,,;, and Ômax so that the local 
depth range Di is fully sampled (cf. Section 4.3.2.2). 

foreach pixel p' € jl and distance 6 € [Omin, Ômax] do 

- configure scene plane II = (Ny, ô). 

- determine pixels p* in all matching images JL E Q! \ Il, according to the 
homography H (cf. Equation (2.7)): 

pk = H (TI, Pl, Pt) - p“. 

- warp local image patches p! € Ji around p*, with the same size as the support 

region of the matching cost function C(-), into IŁ: 


R = H(I PeP) A. 


Tı 


- compute the matching cost s (p, II) between the reference patch Pl, € Ilp and JI 


for left and right subset of cameras separately : 


- store the minimum of left and right matching cost (accounting for occlusions as 
described by Kang et al. (2001)) into the three-dimensional cost volume S : 


S! (p, II) = min{s" (p, II), st (p, ID}. 


end 


and maximum depth, namely din and dmax. This, however, does not hold for non-fronto- 
parallel plane orientations. To find the bounding planes for slanted plane orientations, a 
view-frustum, corresponding to the reference camera, is first constructed for which the depth 
is to be estimated. This view-frustum is represented by a pyramid that resembles the field- 
of-view of the camera and that is truncated by two fronto-parallel near and far planes that 
are located at dimin and dmax. Given the four corner points of the view-frustum on the near 
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plane M}“” and the four on the far plane Me, the minimum and maximum distance ô nin and 
Omax are found according to: 


Omin = min(|Ny; -M;“"|), and 


(4.2) 
Of = max(|N], . Mi"). 
l 


In order to avoid an orientation flip of the image data, all camera centers C; need to lie in front 
of II min with respect to the sweeping direction, thus for all camera centers N ia -Ci +ômin > 0 
must hold. (Ruf et al. 2021b, Sec. 2.2.2.) 


4.3.2.3 Finding the Sampling Steps by Utilizing the Cross-Ratio 


As stated by Equation (2.7), the sampling planes II of the plane-sweep algorithm are pa- 
rameterized by two parameters, namely the normal vector Ny, denoting the orientation and 
sweeping direction of the plane, and the orthogonal distance 6 from the optical center of the 
reference camera C,.¢. While Ny allows to adjust the warping of the images and, thus, the im- 
age matching to the surface orientation within the scene, the second parameter ô determines 
the step-size with which the scene is sampled. In this, a straight-forward approach would 
be to select the step-size in such a way that the scene is sampled with a desired resolution, 
i.e. sweeping the planes at equidistant unit intervals through the scene-space. However, it 
is not guaranteed that a thorough sampling of the scene with a small step-size results in a 
higher accuracy. If the step-size is not chosen in accordance with the camera positions of the 
input images and the baseline between the cameras, the matching results of two or more con- 
secutive plane positions might not reveal enough difference and, thus, introduce ambiguities 
between multiple plane hypotheses. Furthermore, in terms of efficiency, it is important to 
vary the sampling rate in scene-space with respect to the distance of the plane relative to the 
reference camera, since the perspective projection requires an increasingly smaller step-size 
as the plane moves closer to the camera. (Ruf et al. 2021b, Sec. 2.2.3.) 


Thus, a common approach is to select the sampling positions of the planes according to 
the disparity change induced by two consecutive planes. In this, the pixel-wise motion be- 
tween the warped images of two consecutive planes should not exceed an absolute value of 
1 (Szeliski and Scharstein 2004, Pollefeys et al. 2008). To ensure that the maximum disparity 
change between the warped images of two consecutive planes is less or equal to 1, Polle- 
feys et al. (2008) evaluate the displacements occurring at the boundaries of the most distant 
images for a set of planes with predefined parameters and only select those that fulfill the 
stated criterion for the actual image matching. Thus, for each plane within the predefined 
set, an additional test is performed, involving the warping of image points at the boundaries, 
in order to determine whether the plane is suitable or not. (Ruf et al. 2021b, Sec. 2.2.3.) 
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[+ 
A(Mı ‚M2) 


Figure 4.2: Illustration of the cross-ratio. The cross-ratio between four collinear points is the same on all three lines 
and thus invariant under the perspective projection. (Ruf et al. 2021b, Fig. A12.) 


In contrast, FaSS-MVS aims to directly derive the distances of the sampling planes from cor- 
respondences in image space, implicitly leading to an inverse depth sampling in scene-space. 
Given an image point m’ € I,e in the reference image, an intuitive approach would be to 
select multiple sampling points m“ € J; on the corresponding epipolar line IK in one ofthe 


fand 


other cameras, and find the corresponding plane distances by triangulation between m° 
m*. The effort of triangulation, however, can be avoided by relying on the cross-ratio which 


is invariant under perspective projection. (Ruf et al. 2021b, Sec. 2.2.3.) 


The Cross-Ratio as Invariant under Perspective Projection: 


Given four collinear points, the cross-ratio describes the relative distance between them. 
While under perspective projection the relative distances between the points change, their 
cross-ratio does not change and is thus invariant. As illustrated by Figure 4.2, four points 
Mı, M2, M3 and M; that lie on a straight line Lg are perspectively projected onto further 
non-parallel lines, e.g. L4 and L}. If the distance A(M;,M j) between two points M; and Mj is 
known, the cross-ratio Q(-) is calculated according to: 


A(M, »M3) A(M,,M4) 


Q(M,, Mə, M3, M4) = — 3 E, 
Te 34" A(M1,M4) - A(M2,M3) 


(4.3) 
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It is the same on all three lines. Thus, if the positions of the points on one line are known, 
their positions on the other lines can be deduced. Moreover, given the four rays V1, V2, V3 
and V4, which are going through the center of projection C and each of the four points, as 
well as their pairwise enclosed angles a(V;,V;), the cross-ratio can be extended to: 


Q(M;, M2, Ms, M4) = Q(Vı, V2, V3, V4) 
sin &(V],V3) sin a(V2,V4) (4.4) 
sin &(V],V4) - sina(V2,V3) 


(Ruf et al. 2021b, Appendix A.) 


Finding Plane Distances by Utilizing the Cross-Ratio: 


Given the projection matrices P,e and P, of two cameras, as illustrated in Figure 4.3, the 
coordinate system is first centered in the optical center of the reference camera C,.r, so that 


P.e = K [I 0]. Thus, the plane distances determined by this approach are relative to C,.. In 


< RT 
Cref | 


Figure 4.3: Illustration of determining the orthogonal distance parameter ofthe sampling planes of the plane-sweep 
multi-image matching by using the cross-ratio and epipolar geometry. Here, C,.r and Cx represent the 
positions ofthe optical centers ofthe two cameras. (Ruf et al. 2021b, Fig. 3.) 
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case of MVS with multiple cameras, the camera which will induce the largest image offset 
and thus gives an upper bound on the disparity range is selected, for C,. As already noted 
by Pollefeys et al. (2008), this is typically the camera which is most distant from the reference 
camera. (Ruf et al. 2021b, Sec. 2.2.3.) 


Furthermore, it is again assumed that two bounding planes Imin and I,,,,, which limit the 
sweep space of the planes, are given. Just as in the work of Pollefeys et al. (2008), it is ensured 


Algorithm 4.2: Finding plane distances 6 by utilizing the cross-ratio. (Ruf et al. 2021b, 
Alg. 2.) 


Input Data: two cameras with full projection matrices P,.¢ and P,, an image point mf 
inducing largest disparity when warped from J,.r to Jz, as well as two 
bounding planes Inin and Imax: 

Result: list of orthogonal plane distances 6 relative to C,.r, such that the maximum 

pixel-displacement between the warped images of two consecutive planes is 
less than or equal to 1. 


- calculate the viewing ray V!, going through C,.r and m", and intersect it with Tin 
and II .x, yielding the scene points Mn and Myax- 

- project the optical center C,.r as well as M,,;, and M,,,, onto the image plane a the 
second camera, yielding the epipole ek, and the two image points me. „and m*a, all 


lying on the epipolar line 1X. 


k k 
- determine the unit vector k = ee being the normalized direction of 1K, 
mk ax 
pointing ae mk, to mĚ 
k k ene 
for m, < m, tom min BY m, = m; +k do 


- given thes oe Fa Ve F Vinn” vk and Ve. going through the optical center 
m* 


ref? Min’ 
to compute M; € V%*! according to: 


Cx and ek mk and mk... respectively, apply Equation (4.3) and Equation (4.4) 


sin(a(VE_., VE moa? sin(a(VK, Vinee) 
sin(a(v* „V 


E Va) e sin(a (VE Vi) 
_ A(C..M;) E A(Mmin:Mmax) 
Z A(C,efMmax) ö A(M yin»Mi) 


QVE ve Mmin? Vii Me 


Mmax 


= 


Mmin? 


- since Q(Crets Mnin» M; , Mmax) = Q(Crets Ô min» 5, Ô max) derive 6 relative to Cre 
according to: 


ok (max = Omin) = sin(a(Vé.., Vii DE sin(a (Vi ¥ =) 
Ömax * (Ô — Onin) — sin(a(VE_,,VE_))- sina VE „VE 


ref? Mmax Min ? 


end 
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that the sweep space does not intersect the convex hull of the cameras, in order to avoid 
inversions in the process of image matching (cf. Section 4.3.2.2). The image point m’ is 
chosen as the pixel which induces the largest disparity when warped from J,. to J, via 
H (Ilin, Pier, Pk), which is typically one of the four corners. This guarantees a maximum 
disparity change between successive planes and allows to find a set of sampling planes, with 
distances 6 relative to C,.r according to Algorithm 4.2 and illustrated by Figure 4.3. (Ruf et al. 


2021b, Sec. 2.2.3.) 


This approach is computationally efficient and not restricted to a fronto-parallel orientation 
of the sampling planes as long as the optical axis of the reference camera intersects with the 
planes and the sweeping vector has a component that is parallel to the optical axis. Further- 
more, in order to accommodate for all possible setups of C,.r and Cx, it is important to use 
ave „vk vn : Ve) in Algorithm 4.2, since ek, would flip to the side of m% „ if the focal 


eref? Mmin 


plane of the reference camera is behind C,. (Ruf et al. 2021b, Sec. 2.2.3.) 


4.3.3 Depth Map Computation with Surface-Aware Semi-Global 
Matching 


The hierarchical plane-sweep algorithm for multi-image matching, as described in Sec- 
tion 4.3.2, produces at each pyramid level a three-dimensional cost volume $!{p, IT), which 
holds matching costs for each pixel p € J,,.¢ induced by a given plane II located at distance 
ô orthogonal to the optical center C,.r of the reference camera. In the second stage of the 
depth estimation within FaSS-MVS, the cost volume is regularized by a semi-global opti- 
mization scheme, yielding a dense depth map D!. Building upon the original SGM approach 
(Hirschmüller 2005, Hirschmiiller 2008) (cf. Section 2.2.1), three different optimization 
schemes (SGM*) are proposed in the scope of this work. Apart from a straight-forward adap- 
tation of the SGM approach to the plane-sweep sampling, FaSS-MVS adopts the approach 
of Scharstein et al. (2017) to also favor slanted surfaces by considering surface information 
available in the form of surface normals. Furthermore, a third extension penalizes deviations 
from the gradient of the minimum cost path within the SGM optimization scheme. (Ruf 
et al. 2021b, Sec. 2.3.) 


4.3.3.1 Extension of Semi-Global Matching to Plane-Sweep Matching and Different 
Surface Orientations 


The following section describes in detail how the proposed SGM algorithm is adopted with 
respect to the creation of depth hypotheses with the help of plane-sweep multi-image match- 
ing, as well as the extensions that are employed to account for non-fronto-parallel surfaces. 
The proposed extensions SGM* only affect the aggregation of the matching costs along the 
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Figure 4.4: Illustration of the different path aggregation strategies along one path direction r within the three presented SGM* optimization schemes. Col- 
umn 1: Reference image and normal map of a building. Illustrated area marked with yellow line. Column 2: SGM" path aggregation. The blue 
and pink lines represent the blue and pink surface orientations on the building facade. Aggregating the path costs for pixel p at plane II, SGM! will 
incorporate the previous costs at the same plane position (green) without additional penalty. The previous path costs at I(II) +1 (yellow) will be 
penalized with 91. The previous path costs located at I (II) +2 (red), which is actually located on the corresponding surface, will be penalized with the 


highest penalty #2. Column 3: SGMH uses the normal vector Np, encoding the surface orientation at pixel p, and computes a discrete index jump 
Ai,,, which ideally adjusts the zero-cost transition, causing the previous path costs at I (TT) +2 to not be penalized. Column 4: Similar to SGM", 
SGM!"?8 adjusts the zero-cost transition. However, the discrete index jump Ai, is derived from the running gradient Vr of the minimum cost path. 
(Ruf et al. 2021b, Fig. 4.) 
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concentric paths. Thus, the extraction of the depth map D is done analogously to Equa- 
tion (2.12) and Equation (2.9) of the SGM algorithm, with the disparity being substituted by 
depth. If a fronto-parallel plane orientation is considered during the plane-sweep, the depth 
can be directly extracted from the plane parameterization. However, for non-fronto-parallel 
orientations, the depth map D is computed by a pixel-wise intersection of the viewing rays 
with the corresponding WTA solutions (cf. Equation (2.8)). (Ruf et al. 2021b, Sec. 2.3.2.) 


Resolving Plane Hypotheses with Semi-Global Matching: 


Since the plane-sweep algorithm does not compute hypotheses on disparities, but rather 
pixel-wise plane distances relative to the reference camera and thus depth, the first SGM 
extension proposed is a straight-forward adaptation of the standard SGM algorithm to a 
multi-view plane-sweep sampling. In this, the formulation of the SGM path aggregation (cf. 
Equation (2.11)) is modified to 


L,(p, ID = 8(p, ID + min (L,(p = r, IT) + WÜLIP)), (4.5) 


where II denotes the sampling plane at distance 6. The smoothness term Vq now penalizes the 
selection of different planes between adjacent pixels along the path L,, instead of disparities. 
It is formulated as: 


0 ,if 1) =I 
VAIL’) = 49, ‚if CD — 11’) = 1 (4.6) 
p if D- Im) > 1, 


with I(-) being a function that returns the index of II within the set of sampling planes (cf. 
Figure 4.4, Column 2). This extension is denoted as plane-wise SGM (SGMĦ). Given a pixel- 
wise WTA plane parameterization, the corresponding depth is extracted by intersecting the 
viewing ray through pixel p with the corresponding plane according to Equation (2.8). (Ruf 
et al. 2021b, Sec. 2.3.2.) 


Incorporating Surface Normals to Adjust the Zero-Cost Transition: 


The smoothness term of the initial SGM algorithm is formulated with discrete disparity differ- 
ences (cf. Equation (2.11)), penalizing discrete disparity jumps between neighboring pixels. 
In its optimization scheme, it does not consider any subpixel disparity and thus favors fronto- 
parallel surface structures, leading to stair-casing artifacts if no post-processing is employed 
(Scharstein et al. 2017). The same holds for the first extension, SGM". Even though the 
plane-sweep sampling also supports non-fronto-parallel plane orientations, the smoothness 


89 


4 Fast Multi-View Stereo Depth Estimation from Monocular Video Data 


term of SGM" (cf. Equation (4.6)) does not, and strongly penalizes index jumps in the sam- 
pling planes of more than 1. While this is desired if the plane-orientation coincides with 
the surface orientation, it will still lead to stair-casing artifacts if the surface and plane ori- 
entations do not align. In order to overcome the favoring of fronto-parallel structures and 
adjust the smoothness term of SGM to surfaces that are slanted with respect to the sampling 
direction, Scharstein et al. (2017) suggest to add an offset to the smoothness term. This offset 
can be extracted from additional information on the surface orientation, e.g. surface normals, 
that will make the zero-cost transition coincide with the surface orientation. In the scope of 
this work, this approach is adopted as part of the second extension. It is thus called surface 
normal SGM (SGMI"), (Ruf et al. 2021b, Sec. 2.3.2.) 


In the presented hierarchical approach, the normal vectors from the normal map N'*!, which 
was estimated in the previous level of the pyramid (cf. Figure 4.1), are extracted. The pixel- 
wise normal vectors N, = N’ 1+1(p) indicate the surface orientation at the scene point M, 
which is computed by intersecting the viewing ray through p with the plane II. From this, 
the discrete index jump Ai,, through the set of sampling planes can be calculated which is 
caused by the tangent plane to N,. Since the plane-sweep sampling is not restricted to fronto- 
parallel plane orientations, the index jump Ai,,, needs to be computed based on the difference 
between the tangent plane at M and the orientation of the sampling planes in the direction 
r of the currently considered aggregation path. With Ai,,,, the smoothness term used by the 
extension SGM" is adjusted according to 


0 , if I(T) + Ain = IV) 
Vır-sn(IL,IT’) = 4771 > if IC) + Aisn _ IAT)| =1 (4.7) 
D2 ’ if AD + Aisn u IAT)| > 1, 


This allows to align the zero-cost transition of the SGM path aggregation to the surface ori- 
entation of the scene (cf. Figure 4.4, Column 3). The pixel-wise discrete index jumps can be 
computed once for each pixel p and each path direction r, as also noted by Scharstein et al. 
(2017), providing little computational overhead. (Ruf et al. 2021b, Sec. 2.3.2.) 


Penalizing Deviations from the Gradient of the Minimum Cost Path: 


Instead of relying on additional information, e.g. normal vectors, the third extension com- 
putes the running gradient Vr in scene-space, which corresponds to the minimal path costs, 
in order to adjust the zero-cost transition in the aggregation of the path costs for non-fronto- 
parallel surface orientations. Hence, it is denoted as path gradient SGM (SGM"*8), (Ruf et al. 
2021b, Sec. 2.3.2.) 


The gradient vector Vr = M — M’ in scene-space is dynamically computed while traversing 
along the path r. In this, M again denotes the scene point that is found by intersecting the 
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viewing ray through p with II, while M’ denotes the scene point, which is parameterized by 
p’ and the plane IT’. Here, p’ = p — r represents the predecessor of p along the path r and IT’ 
denotes the plane at distance 6 = argmin 5 L,(p’, I) associated with the previous minimal 
path costs. (Ruf et al. 2021b, Sec. 2.3.2.) 


From this, a discrete index jump Ai, is computed, which is again used to account for possibly 
slanted surfaces in scene-space by adjusting the zero-cost transition of the smoothness term 
according to 


0 , if I(T) + Aip = I(T’) 
Vit pg ILI’) =49 if ID) + Aing E rar’)| =l en 
9% ‚if OD + Ai,, -10| > 1, 


This implicitly penalizes deviations from the running gradient between two scene points cor- 
responding to two consecutive pixels on the aggregation path r (cf. Figure 4.4, Column 4). 
(Ruf et al. 2021b, Sec. 2.3.2.) 


4.3.3.2 Adaptive Smoothness Penalties Within Semi-Global Matching 


In the original publication (Hirschmiiller 2005, Hirschmiiller 2008) of the SGM algorithm, the 
author suggests to use an adaptive adjustment of the second penalty p, according to the image 
gradient along path r. This should enforce a preservation of depth discontinuities at object 
boundaries. In this work, the adaptive adjustment of p, is based on the absolute intensity 
difference (AJ,,,) between two neighboring pixels p and q. It is formulated by 


BEE Rn. 
P2 = Pı +a- exp B (4.9) 


with AJ, = |J(p) — J(q)|. Moreover, æ = 8 and $ = 10 are set according to Scharstein et al. 
(2017). Since FaSS-MVS uses multiple matching images, the SGM penalties are multiplied 
with the number of input images inside the left and right subsets with respect to Jef, since 
the matching costs are summed up within these image sets. (Ruf et al. 2021b, Sec. 2.3.3.) 


4.3.3.3 Depth Refinement 


In a final optimization step in the process of depth map estimation, the pixel-wise estimates 
are refined to lie in between the actual sampling steps of the plane-sweep sampling. Here, 
the simple and yet effective approach to fit a parabola through the WTA solution and its 
two neighbors is employed. This is done analogously to implementing a disparity refinement 
based on curve fitting for the stereo normal case (cf. Section 2.3.1). But since the sampling 
points are not equidistantly spaced between each other, the depth differences between the 
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WTA solution and its two neighbors needs to be considered in the optimization and the find- 
ing of the curves’ minimum. (Ruf et al. 2021b, Sec. 2.3.4.) 


4.3.4 Extraction of Surface Normals from Depth Maps 


From the estimated depth map D, a normal map N is computed, holding the local surface 
orientations in the form of three-dimensional normal vectors. The surface normal vectors N, 
are calculated for each pixel p by reprojecting D into a three-dimensional point cloud, based 
on the intrinsic camera parameters and the corresponding depth estimate. Then, the cross- 
product N, = H, X V, is computed, with H, being the difference vector between the scene 
points of two neighboring pixels to p in horizontal direction, and V, being the difference 
vector in vertical direction. (Ruf et al. 2021b, Sec. 2.4.) 


Solely using the cross-product to compute the surface orientation does not incorporate any lo- 
cal smoothness assumption, which is why an a-posteriori smoothing is applied to the normal 
map. In this, an appearance-based weighted Gaussian smoothing in a local two-dimensional 
window W, around p is employed, which adjusts the smoothing strength depending on the 
intensity difference between q € W, and p: 


Np 

Np) =— (4.10) 
IN;| 

with 
2 

5 1 (q-p) =) 

Ñ =N +Y N: (Ze (4.11) 

g ‘ 2, S y 2202 20? ß 


Similar as in Equation (4.9), £ is set to 10, while ø is fixed to the radius of W,. (Ruf et al. 
2021b, Sec. 2.4.) 


4.3.5 Estimation of Confidence Measures Based on Surface 
Orientation 


Apart from the depth map D and the normal map NV, FaSS-MVS also computes confidence 
measures with respect to the depth estimates in the range of [0, 1] and stores them inside a 
confidence map C. Such confidence measures allow a subsequent reasoning on the certainty 
of the corresponding estimates and, in turn, improve further processing. Thus, confidence 
maps are helpful byproducts for subsequent steps, such as depth map fusion or scene inter- 
pretation. Furthermore, they allow to gain more insight on the effects of different config- 
urations of the presented approach, based on receiver operating characteristic (ROC) curve 
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analysis (cf. Figure 4.8). The computation of the pixel-wise confidence measures in the scope 
of this work relies on the geometric characteristic of the estimated depth map and is deduced 
from the normal vectors stored inside the normal map N and the plane orientations of the 
plane-sweep sampling. (Ruf et al. 2021b, Sec. 2.5.) 


In particular, the geometric confidence measure is based on the enclosed angles between the 
local surface orientation stored inside the normal map N, = N(p), the orientation of the 
sampling plane Ny and the reverted viewing direction V. This is adopted from the geometric 
weighting factor proposed by Kolev et al. (2014). They argue that a depth estimate is more 
accurate if the surface orientation of the observed geometry is fronto-parallel to the image 
plane of the camera, and less accurate if the camera is observing slanted surfaces. This cor- 
relation is modeled by the scalar product between the surface orientation and the reverted 
viewing direction. Furthermore, since the image warping, as part of the image matching, can 
be aligned to the surface orientation by adjusting the normal vector of the plane-sweep algo- 
rithm, the plane orientation is also considered. Thus, the geometry-based weighting factor 
is computed according to: 


(Np No NV) — cosp 


C(p) = 1 — cose 
(0) 


‚If{<(N,Np) A 4(Ng. V} < 
{<(N,,Nn) (Nm V} <p (4.12) 
, otherwise . 


All of the above vectors are assumed to be normalized and given with respect to the local 
coordinate system of the camera, thus V = (0 0-1)!. Just as in the work of Kolev et al. (2014), a 
critical angle 0 = 60° is used to mark the measurements, for which the enclosed angles exceed 
this threshold, as unreliable. The additional consideration of Ny in Equation (4.12) implicitly 
models the indirect matching of the input images via the plane-induced homography. (Ruf 
et al. 2021b, Sec. 2.5.) 


4.3.6 Post-Processing and Depth Map Filtering 


In a final post-processing step, remaining outliers are removed from the depth, normal and 
confidence maps by applying a difference-of-Gaussian (DOG) filtering and a geometric con- 
sistency check, as described in Section 2.3.2.2 and Section 2.3.3.2, respectively. Here, for all 
pixels that are invalidated in the depth map, the corresponding pixels inside the normal and 
confidence maps are also invalidated. (Ruf et al. 2021b, Sec. 2.6.) 


In the scope of the DOG filter (cf. Algorithm 2.2), the kernel size of the Gaussian filter kernel 
and the dilation is set to 7 X7 pixels and 3X 3 pixels, respectively. Furthermore, the respective 
speckle filter sizes are set to @ = 7 and 6 = 21. For the multi-view geometric consistency 
check (cf. Algorithm 2.3), the size of the sliding window is set to |¥| = 5, the threshold for 
the reprojection error to 7, = 10 and the consistency threshold to 7, = 3. 
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4.4 Experiments 


The following sections present the results of experiments conducted in the scope ofthis work. 
In these, the effects of different configurations of the presented approach are evaluated and 
analyzed with respect to individual aspects, such as accuracy, efficiency and application- 
specific usability. First, the datasets and evaluation metrics used to investigate the potential 
of FaSS-MVS are introduced in Section 4.4.1 and Section 4.4.2, respectively. Then, the need 
for a hierarchical processing is evaluated and the optimal number of input images, i.e. the 
optimal size of the input bundle, is empirically found in Section 4.4.3. This is followed by a 
short comparison in Section 4.4.4, where the focus is set on the two similarity-metrics and 
cost functions that are implemented in the scope of this work. In Section 4.4.5, the ability of 
the three SGM extensions to reconstruct non-fronto-parallel surface structures is evaluated 
and compared to the effects of using a non-fronto-parallel plane orientation within plane- 
sweep sampling. An evaluation of the improvements gained by post-filtering is presented in 
Section 4.4.6, before comparing the results of the best configurations to those achieved by 
approaches for offline MVS, e.g. COLMAP (Schönberger et al. 2016), in Section 4.4.7. Finally, 
before discussing the presented results, the outcomes of use-case-specific experiments are 
shown and qualitatively illustrated in Section 4.4.8. (Ruf et al. 2021b, Sec. 3.) 


The complete processing pipeline of the presented approach, except the generation of the 
Gaussian image pyramids and the parameterization of the plane-sweep algorithm, is imple- 
mented in CUDA and is thus optimized for massively parallel computing by GPGPU, which, in 
turn, is embedded in a C++ application. All experiments, and thus all timing measurements, 
were conducted using a NVIDIA Titan X GPU and an Intel XEON CPU E5-2650 running 
with 2.20 GHz. Even though the CPU is designed for a server architecture, only a small part 
of FaSS-MVS is run on the CPU, and thus its superiority over commodity desktop hardware 
is insignificant. (Ruf et al. 2021b, Sec. 3.) 


4.4.1 Evaluation Datasets 


FaSS-MVS is quantitatively evaluated on two public datasets, which also provide an appropri- 
ate ground truth, namely the DTU Robot MVS dataset (Jensen et al. 2014, Aanes et al. 2016) 
and the dataset from the 3DOMcity Benchmark (Oezdemir et al. 2019). In order to provide 
appropriate data for a quantitative evaluation, these two datasets rely on images of scale- 
modeled buildings and an urban scenery from which an accurate ground truth is acquired. 
For a qualitative evaluation and a discussion regarding the usability of FaSS-MVS for online 
dense image matching and 3D reconstruction, two privately captured datasets of real-world 
scenes are used, henceforth referred to as the TMB and the FB dataset. In the following, the 
characteristics of these datasets are briefly introduced, and information on which portions 
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Figure 4.5: Overview of the four datasets used for performance evaluation of FaSS-MVS. Column 1: Two building 
models from the DTU Robot MVS dataset. Column 2: Example images in oblique and nadir view from 
the 3DOMcity Benchmark dataset. Column 3: Excerpt of the privately acquired TMB dataset. Col- 
umn 4: Use-case-specific dataset acquired during an exercise of the local fire brigade. (Ruf et al. 2021b, 
Fig. 5.) 


of the data sets are used and what kind of ground truth is available for the evaluation are 
discussed. (Ruf et al. 2021b, Sec. 3.1.) 


DTU Robot MVS Dataset: 


The DTU Robot MVS dataset (Figure 4.5, Column 1) is comprised of 124 different tabletop 
scenes. For each scene, input images are provided, which were captured under eight different 
lighting conditions from 49 locations. These locations are the same for all scenes, since the 
camera is mounted on an industrial robot arm, which could repeatedly be moved to the set 
camera poses. Furthermore, for each scene, a ground truth is provided in the form of a detailed 
point cloud captured by a structured light scanner. (Ruf et al. 2021b, Sec. 3.1.) 


For the quantitative evaluation of the performance of the presented approach, 21 scans of 
different building models were selected, since these scenes are closest to the targeted use- 
case, namely the online reconstruction of urban scenes and man-made structures. In this, 
the already undistorted images with an image resolution of 1600 x 1200 pixels, together with 
the corresponding intrinsic and extrinsic camera data, were used as input data to the ap- 
proach. The benchmark provides an evaluation routine and a corresponding score table for 
the assessment of a reconstructed point cloud with respect to the ground truth captured by 
the structured light scanner. However, the focus of this work lies in the estimation of depth 
and normal maps only, without the subsequent fusion into a three-dimensional point cloud. 
Thus, in order to evaluate the accuracy of the estimated depth and normal maps, correspond- 
ing ground truth data is extracted from the structured light scans by rendering them from 
the viewpoints of the corresponding input images, given the intrinsic and extrinsic camera 
data. (Ruf et al. 2021b, Sec. 3.1.) 
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3DOMcity Benchmark Dataset: 


Within the DTU benchmark, the camera is moved along a circular trajectory, resembling 
an orbital flight of a UAV, with the camera focusing on the object of interest. This kind of 
camera movement or flight trajectory is typical for the case when an object is to be fully 
reconstructed with a maximum precision (Wenzel et al. 2013a). However, depending on the 
aircraft and the surrounding constraints, such a flight is not always feasible or desired. Typical 
for the mapping of a larger area, a grid flight is performed, where the aircraft is flying linearly 
over the area of interest, with the camera orientation being fixed with respect to the sensor 
carrier. (Ruf et al. 2021b, Sec. 3.1.) 


Such a configuration is simulated by the data provided as part of the 3DOMcity Benchmark 
(Oezdemir et al. 2019) (Figure 4.5, Column 2). In this, images of a scale-modeled urban scene, 
comprised of differently sized and shaped buildings, as well as streets and vegetation, are 
captured with a DSLR camera which is moved along a rigid bar in parallel lines over the 
model. The distorted and undistorted images are provided with a maximum resolution of 
6016 x 4016pixels from oblique and nadir vantage points, with a forward and sideways 
overlap of 85/75 % in case of the oblique views and 80/65 % between the nadir views. While 
the benchmark internally uses a point cloud of two exemplary building models, which is 
captured by a multi-stripe triangulation-based laser scanner in order to assess the accuracy 
of submitted DIM point clouds, it publicly only provides a reference point cloud for the task of 
classification, which is computed by a semi-global DIM algorithm. (Ruf et al. 2021b, Sec. 3.1.) 


To use the data of the 3DOMcity Benchmark for the quantitative evaluation of the perfor- 
mance of the presented approach, the already undistorted images are first downscaled to a 
size of 1798 x 1200 pixels, preserving the initial aspect ratio, before estimating the intrinsic 
camera parameters with the help of COLMAP (Schönberger and Frahm 2016). The extrinsic 
camera data is extracted from the reference that is provided as part of the benchmark. For 
the accuracy assessment of the depth maps, the reference data is computed by rendering the 
reference DIM point cloud of the whole model, provided for the task of point cloud classifi- 
cation, from the viewpoints of the input images, just as in the case of the dataset of the DTU 
benchmark. (Ruf et al. 2021b, Sec. 3.1.) 


Real-World Use-Case-Specific TMB and FB Dataset: 


The strength and aim of the datasets from the DTU and 3DOMcity benchmarks are their 
small scale and the associated ability to record or compute accurate reference data, which 
in turn facilitates a quantitative evaluation of the accuracy of the assessed algorithms. These 
datasets, however, are recorded under controlled environments and do not fully accommodate 
for the use-case aimed at with FaSS-MVS, namely the online MVS from monocular video data 
captured by COTS UAVs, flying at altitudes below 100 m. In order to account for the named 
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use-case and perform a qualitative evaluation on real-world data, appropriate test data was 
captured and collected into a private dataset. (Ruf et al. 2021b, Sec. 3.1.) 


This dataset is two-fold. The first part, namely the TMB dataset (Figure 4.5, Column 3), consists 
of four sequences captured by a DJI Phantom 3 professional, flying around a free-standing 
house and containers at altitudes between 8m to 15m. For the second part, which is denoted 
as the fire brigade (FB) dataset (Figure 4.5, Column 4), images were acquired during a fire 
brigade exercise around a big industrial building. The data was captured using a DJI Ma- 
trice 200 with a Zenmuse XT2 sensor flying linearly over the area on which the exercise was 
performed. For all sequences, the images were captured at a frame rate of about 1 FPS and 
downsampled to an image size of 1920 x 1080 pixels. However, due to the varying velocities, 
the distances between the frames are not always the same and thus images that are not appro- 
priate as input to the presented approach, e.g. by providing too little offset, are discarded. As 
reference data, detailed point clouds of each sequence were computed by means of structure- 
from-motion (SFM) and MVS using COLMAP (Schönberger and Frahm 2016, Schönberger 
et al. 2016). Given the GNSS metadata provided by the UAVs for each input image, the refer- 
ence data was transformed into a local east-north-up (ENU) frame in order to have a metric 
reference. The undistorted images as well as the intrinsic and extrinsic camera data produced 
by COLMAP serve as input to FaSS-MVS for the evaluation with respect to the TMB and FB 
dataset. (Ruf et al. 2021b, Sec. 3.1.) 


4.4.2 Error Measures 


For a direct quantification of the error between the estimated depth map D,.,, and the corre- 


sponding ground truth D,,, absolute and relative L1 measures are used: 


gt 
1 
L-abs(Dests Dut) = TH Dy Den) = Palp), and (4.13) 
pEVv 
1 Dest) - Dg) 
L1-rel(D.s Da) = — —, 4.14 
FM 2 Dal) oe 


In this, V denotes the set of pixels for which both Des and Dg, have valid depth measure- 
ments. While L1-abs provides an absolute and, in turn, interpretable insight on the mean 
error of the estimated depth map, it is rather unsuitable for comparing the results across 
multiple datasets with different depth ranges. This is because the error of depth measure- 
ments typically increases with increasing depth, which leads to a higher absolute error for 
datasets with a larger scene depth. In order to compensate for this effect, the relative L1-rel 
measure normalizes the absolute difference by the depth stored at the corresponding ground 
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truth pixel. This reduces the effect that erroneous pixels in distant areas of the scene have 
on the error score, while at the same time increasing the weight of the pixels that are close 
to the camera. (Ruf et al. 2021b, Sec. 3.2.) 


The two error measures introduced above provide a simple strategy to evaluate the error of 
the estimates. However, they do not allow to reason about the completeness and density of 
the estimated depth map. Thus, since the focus of this work is on dense MVS, it is also of 
great interest to know how many pixels of D.s are actually filled with correct estimates. To 
do so, two tightly coupled error scores, namely the accuracy (Accg) and completeness (Cpl,), 
are also used. These scores are typically used to evaluate classification tasks, but have also 
been used to evaluate range measurements in recent years (Schöps et al. 2017, Knapitsch et al. 
2017). On the one hand, the accuracy Accg indicates the amount of pixels within the estimated 
depth map Dest, for which the corresponding depth value is within a given threshold 6 to the 
ground truth: 


est(P) | | 
neck Den Det) = TE] P a O O am en 


The completeness Cpl,, on the other hand, indicates the fraction of the ground truth pixels, 
for which estimates exist, which are within the given distance threshold to the reference: 


eat) Dap) 
E P [ms «| Bp Da} <6| (4.16) 
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Cply(Dests Det) = 


Again, V holds the set of pixels for which both Des and Dg, have valid depth measurements. 
Similarly, € denotes the pixel set with valid estimates, while G holds the pixels with valid 
ground truth values. In both Equation (4.15) and Equation (4.16), the operator [-] refers to 
the Iverson bracket. The threshold 8 is given as the percentage of the corresponding ground 
truth value. For example, Accı „; and Cpl 
and Det, 


of the corresponding ground truth depth. These two measures can further be summarized 


1.25 hold the fraction of pixels with respect to the Dest 


for which the difference between the estimate and ground truth is smaller than 25 % 
into a combined score, namely the Fg-score, which is the harmonic mean between the Accg 
and Cpl,: 


Acco - Cpl, 


Po Da Dy) = 2° T Cpl, 


(4.17) 


Thus, a high Fg-score indicates a good trade-off between the achieved accuracy of the depth 
map and its completeness with respect to the ground truth. (Ruf et al. 2021b, Sec. 3.2.) 
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4.4.3 Need for Hierarchical Processing and Finding the Optimal 
Number of Input Images 


In the first experiment, the need for and importance of the hierarchical processing scheme 
within FaSS-MVS as well as the best configuration on the size of the input bundle are eval- 
uated. Here, a couple of aspects are considered in order to find the best configuration for 
the succeeding experiments. The objective is to find the appropriate number n of Gaussian 
pyramid levels and the size |Q] of the input bundle, providing a good trade-off between 


e the error of the resulting depth maps, measured by L1-abs and L1-rel, 


e the sampling density of the scene and the entailed resources needed for the 
computation, 


e as well as the resulting processing run-time. 


For this experiment, a fronto-parallel plane orientation is used as part of the plane-sweep 
image matching and the NCC with a support region of 5X 5 pixels is set as similarity measure. 
The optimization of the cost volume and the extraction of the optimal depth map is done by 
employing the SGM" scheme, which is the adoption of the standard SGM optimization to 
the use of plane-sweep image matching (cf. Section 4.3.3.1). The smoothness penalty within 
the SGM optimization is set to 9, = 100, while the adaptive 9, penalty as described in 
Section 4.3.3.2 is used. This, together with the 5 x 5 sized NCC as matching cost, was chosen 
in accordance with the work of Scharstein et al. (2017). To find the appropriate height of the 
Gaussian pyramids, the size of the input bundle, i.e. the number of input images, is first set 
to |Q| = 3. (Ruf et al. 2021b, Sec. 3.3.) 


Table 4.1 lists the mean L1 errors of the estimated depth maps when evaluated with different 
number of pyramid levels on the datasets of both the DTU and 3DOMcity benchmark. In this, 


Table 4.1: Mean errors achieved on the DTU and 3DOMcity dataset for different number of Gaussian pyramid 
levels (n) as part of the hierarchical processing scheme. The error metrics used are the absolute L1-abs, 
measured in mm, as well as the relative L1-rel measure. Both are averaged over all evaluated depth maps 
within each dataset. (Ruf et al. 2021b, Tab. 1.) 


n=1 n=2 n=3 n=4 n=5 
Li-abs 26.394 26.221 23.473 25.045 29.676 


Dru (in mm) 24.262 23.835 +19.656 -+19.298 +19.436 

L1-rel 0.036 0.036 0.032 0.034 0.041 

+0.032 +0.032 +0.026 +0.026 +0.026 

Li-abs 12.789 14.936 21.801 32.458 47.422 

i (inmm) 6.916 6.754 +8.010 9.408 22.292 
3DOMcity 


L1-rel 0.010 0.012 0.017 0.026 0.037 
+0.006 +0.006 +0.007 +0.009 +0.014 


99 


4 Fast Multi-View Stereo Depth Estimation from Monocular Video Data 


Table 4.2: Processing run-time measured for different configurations of the pyramid height on the DTU and 
3DOMcity dataset. In addition, the maximum number of sampling planes with which the scene was 
sampled at the highest pyramid level is stated. (Ruf et al. 2021b, Tab. 2.) 


n=1 n=2 n=3 n=4 n=5 


Run-time 2365 1315 386 220 187 
DTU (in ms) +15 +10 +2 +2 +1 

max. # planes 256 256 128 64 32 

Run-time 613 431 225 196 192 
3DOMcity (in ms) +3 +3 +1 +1 

max. # planes 128 64 32 16 8 


the absolute and relative L1 measures are used, averaged over all depth maps within each 
dataset. It is to be expected, that the omission of any hierarchical processing, i.e. the use 
of only one pyramid level and thus no coarse-to-fine processing, would lead to the smallest 
error between the estimate and ground truth. However, the results reveal that in case of the 
DTU dataset the smallest mean error, even if it is only slightly smaller, is achieved when 
setting n = 3, while the best result in case of the 3DOMcity dataset is achieved at n = 1. 
(Ruf et al. 2021b, Sec. 3.3.) 


As described in Section 4.3.2.3, the plane distances within the plane-sweep sampling, and 
thus the sampling points, are selected in such a way that two consecutive planes induce a 
maximum disparity difference of 1 pixel. Depending on the capturing setup, i.e. the relative 
poses between the images and their obliqueness and, in turn, the range of the scene depth, this 
can lead to a very high number of sampling points and with it to a large memory consumption, 
as the dimensions of the three-dimensional cost volume need to be set accordingly. Thus, 
in order to not exceed the memory limit, the maximum number of sampling points for the 
highest pyramid level is restricted to 256 in the implementation of the approach. In case of 
the camera setup of the DTU dataset and the configuration of this experiment, i.e. having a 
bundle size of |Q| = 3, a pyramid height of 3 is the smallest height at which the number of 
sampling points at the highest level does not reach or exceed the set limit, as Table 4.2 shows. 
Comparing Table 4.1 and Table 4.2 further reveals that on both datasets the best results are 
achieved, when the computation is initialized at the highest pyramid level with a maximum 
of 128 sampling planes. (Ruf et al. 2021b, Sec. 3.3.) 


Another criterion which is used to deduce the best configuration on the height of the Gaussian 
pyramid is the run-time needed to estimate a single depth map. Table 4.2 additionally lists 
the corresponding measurements taken, i.e. the amount of milliseconds it takes to estimate a 
single depth map given a certain number of pyramid levels, as well as the number of planes 
used for sampling the scene-space at the highest pyramid level. The measurements again 
show, that up to n = 3 in case of the DTU dataset, the number of sampling planes at the 
highest pyramid level is equal to the limit of 256 and that with a smaller amount of sampling 
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Table 4.3: Mean errors achieved on the DTU and 3DOMcity dataset for different input bundle sizes (|Q]), ie. number 
of images. In addition, the differences in run-time, with respect to the measurements of the first part (i.e. 
IQ] = 3), are stated. (Ruf et al. 2021b, Tab. 3.) 


el=3 |Q|=5 |Q|=7 


L1-abs 23.473 19.832 21.843 
(in mm) +19.656 +16.225 +21.605 
L1-rel 0.032 0.027 0.031 
DTU 
+0.026 +0.021 +0.031 
A Run-time +271 +302 
(in ms) 
L1-abs 14.936 14.615 16.514 
(in mm) +6.754 +6.254 +7.569 
, L1-rel 0.012 0.012 0.014 
3DOMeity 
+0.006 +0.007 +0.009 
A Run-time +360 +410 
(in ms) 


points the run-time decreased drastically. Furthermore, the significant drop of one second 
in run-time between using a pyramid height of 2 and 3 suggests that the decreasing use of 
processing resources on the GPU increases the processing speed and that going from n = 2 
to n = 3 makes a significant improvement in its efficiency. Since the use of a higher number 
of pyramid levels does not only reduce the amount of sampling points, but also the image 
size at the highest pyramid level and with it the amount of pixels that need to be matched. 
Thus, depending on the camera setup, a hierarchical processing is very important in order to 
ensure a high sampling density of the scene-space, while at the same time efficiently utilizing 
the processing hardware and, in turn, alleviating high processing speeds. In case of the DTU 
dataset, this experiment shows that the best number of pyramid levels to be used is n = 
3, which will thus be set for the successive experiments. In case of the 3DOMcity dataset, 
Table 4.1 suggests that the best configuration is to use the original image size. A hierarchical 
processing scheme is needed, however, in order to use SGM”, the extension of the SGM 
algorithm to consider local surface orientations in order to account for slanted surfaces. Thus, 
in case of the 3DOMcity dataset, the successive experiments will be executed with n = 2, 
which induces only a slightly higher mean error compared to the best configuration. (Ruf 
et al. 2021b, Sec. 3.3.) 


In the second part of this experiment, the effects of a different number of input images and, 
in turn, the optimal size |Q| of the input bundle is evaluated. Here, the settings for the plane- 
sweep image matching and the subsequent SGM optimization are kept the same as before. 
The height of the Gaussian pyramids is fixed to n = 3 in case of the DTU dataset and n = 2 
in case of the data from 3DOMcity dataset. Table 4.3 lists the mean errors achieved on both 
datasets with different number of input images, as well as the difference in run-time with 
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respect to the best configuration of the first part of the experiment. The results reveal that 
the best accuracies are achieved, when five input images are used for image matching, even 
though, in case ofthe 3DOMcity dataset, it is only a marginal improvement. As expected, the 
utilization of more input images in the process of image matching also leads to an increase 
in run-time, since more pixels are matched. At the same time, however, there is more time 
available to keep up with the image acquisition as discussed in Section 4.5.3. In conclusion, 
in the subsequent experiments, the size of the input bundle is set to |Q] = 5, while the height 
of the Gaussian image pyramids is set to n = 3 and n = 2 in case of the DTU and 3DOMcity 
dataset, respectively. (Ruf et al. 2021b, Sec. 3.3.) 


4.4.4 Effects of Different Similarity Measures in the Process of 
Dense Multi-Image Matching 


As part of the plane-sweep multi-image matching, this approach comprises two different 
similarity measures and cost functions: The Hamming distance of the census transform (CT) 
(cf. Section 2.1.1) as well as a truncated and scaled form of the normalized cross correlation 
(NCC) (cf. Section 2.1.2). While the CT is computationally less expensive than the NCC and 
is thus more suitable for real-time or online processing, it is less discriminative, which might 
result in a more ambiguous set of matched pixel correspondences. When working with a 
stereo normal case, in which the input images suffer only from little perspective distortion 
induced by homographic transformations, the CT outperforms the NCC in both run-time 
and accuracy (Ruf et al. 2021a). However, as the results in Table 4.4 show, the perspective 
distortion, resulting from the warping of images from converging cameras by means of the 
plane-induced homography within the plane-sweep algorithm, leads to a significant increase 
in error when using the CT as similarity measure. (Ruf et al. 2021b, Sec. 3.4.) 


Apart from the two different similarity measures, the effects of different support regions are 
also evaluated in the scope of this experiment. In this, for each similarity measure, the two 


Table 4.4: Mean errors achieved on the DTU and 3DOMcity dataset when using different similarity measures and 
cost functions with different support regions. (Ruf et al. 2021b, Tab. 4.) 


CT 5x5 CT 9x7 NCC 5x5 NCCoxo 
Ll-abs 42.136 42.305 19.832 19.667 


DTU (in mm) 37.958  +36.394 +16.225 +16.453 

L1-rel 0.056 0.057 0.027 0.027 

+0.048 0.046 +0.021 +0.021 

Ll-abs 22.128 26.005 14.615 13.789 

. (in mm) 14.218  +14.106 +6.254 +5.962 
3DOMcity 

L1-rel 0.019 0.022 0.012 0.011 

0.014  +0.014 +0.007 +0.006 
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most commonly used configurations were tested, with a support region of a size of 5X5 pixels 
being a good trade-off between uniqueness and computational complexity, while, in case 
of the CT, a support region of a size of 9 X 7pixels is the biggest size for which the bit- 
string still fits into a single 64-bit integer. The configuration of the plane-sweep algorithm 
and the SGM optimization is set in accordance with the values from the first experiment 
(cf. Section 4.4.3). In terms of the SGM penalties, pı is set to 100 for both NCC;,; and 
NCCo9,x9, since the maximum matching cost of the NCC is normalized to 255, independent of 
the support region. For CTsx5; and CTox7, however, Q is set to 9 and 24, respectively, which 
is equivalent to the configuration for NCC, when considering the ratio between g; and the 
maximum matching cost. (Ruf et al. 2021b, Sec. 3.4.) 


Even though the NCC with a support region of 9 x 9 pixels achieves the best results, NCC5x5 
is selected for further experiments, since the rise in error is only little but its computational 
complexity is significantly less and, in turn, its throughput higher than that of NCCoy9 as 
illustrated in Section 3.4.1.5. (Ruf et al. 2021b, Sec. 3.4.) 


4.4.5 Ability to Reconstruct Non-Fronto-Parallel Surface Structures 


As described in Section 4.3.3.1, apart from the straight-forward combination of SGM with 
the plane-sweep sampling (SGM!'), FaSS-MVS comprises two further extensions of the SGM 
algorithm that allow to account for non-fronto-parallel surface structures: namely the in- 
corporation of surface normals to adjust the zero-cost transition in the SGM path aggrega- 
tion (SGM™=") and the penalization of deviations from the gradient of the minimum cost 
path (SGM!¥8), In addition, by selecting an appropriate normal vector, and with it a cor- 
responding sweeping direction, it is also possible to adjust the sampling plane orientations 
of the plane-sweep sampling to the scene structure. In the following, the results achieved 
by SGM"™=" and SGM", in combination with a fronto-parallel sampling plane orientation, 


Table 4.5: Quantitative comparison of the results achieved by FaSS-MVS with different implementation and adap- 
tations of the SGM algorithm in combination with a fronto-parallel sweeping direction in order to also 
account for non-fronto-parallel surfaces. (Ruf et al. 2021b, Tab. 5.) 


SGM! sGM!s  sGM!lrs 


L1-abs 19.832 19.768 19.684 
(in mm) 16.225 +16.192 +16.154 
DTU 
L1-rel 0.027 0.027 0.027 
+0.021 +0.021 +0.021 
L1-abs 14.615 14.673 15.074 
. (in mm) +6.254 +6.229 +6.133 
3DOMcity 
L1-rel 0.012 0.012 0.012 
+0.007 +0.007 +0.006 
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are first evaluated and compared to those achieved by SGM", before the effects of different 
non-fronto-parallel plane orientations are examined. The configuration of the other hyper- 
parameters is as described and evaluated above: the size of the input bundle is set to |Q] = 5, 
the NCC with a support region of 5 x 5 pixels is used as similarity measure and cost func- 
tion. The pyramid height is set ton = 3 and n = 2 for the DTU and 3DOMcity dataset, 
respectively. (Ruf et al. 2021b, Sec. 3.5.) 


The quantitative results displayed in Table 4.5 only reveal minor differences in the L1 error 
between the different implementations of the SGM optimization. While, in case of the DTU 
dataset, the best results are achieved by the SGM"? implementation, on the 3DOMcity data- 
set, the standard adaptation of the SGM optimization to the plane-sweep sampling, i.e. SGM", 
reaches the lowest error. The relative L1 error does not reveal any difference. This is due to 
the fact that the individual L1-rel scores only start to differ from the forth decimal place 
on-wards. Nonetheless, the ranking with respect to L1-rel score corresponds to that of the 


Ground Truth Depth Ground Truth Normal 


Depth Normal Confidence Depth Difference 
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Figure 4.6: Qualitative comparison of the results achieved by the three different SGM implementations of FaSS-MVS 
on the DTU dataset. Row 1: Reference data from the dataset. Row 2 - 4: Data, i.e. depth, normal and 
confidence maps, computed by SGMH, SGM!" and SGM!"P, respectively. As well as difference maps 
holding the pixel-wise absolute difference between the estimated depth map and the ground truth. The 
color encoding reaches from dark blue (low error) via green to yellow (high error). The depth range 
within the depth maps reaches from 580 mm (blue) to 830 mm (red). The estimated maps are masked 
according to the ground truth. (Ruf et al. 2021b, Fig. 6.) 
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L1-abs score. In a qualitative comparison, Figure 4.6 reveals that SGM!" leads to a seem- 
ingly smoother depth and normal map (e.g. on the ground plane), while at the same time, 
however, loosing small details and amplifying unwanted depth discontinuities in some areas, 


Ground Truth Depth Ground Truth Normal Reference Image 


Depth Difference 
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SGMII-sn 
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Figure 4.7: Qualitative comparison of the results achieved by the three different SGM implementations of FaSS-MVS 
on the 3DOMcity dataset. Row 1: Reference data from the dataset. Row 2 - 4: Data, i.e. depth, normal 
and confidence maps, computed by SGM", SGM!" and SGM!!¥£, respectively. As well as difference 
maps holding the pixel-wise absolute difference between the estimated depth map and the ground truth. 
The depth range within the depth maps reaches from 1 m (blue) to 1.8 m (red). The color encoding of 
the difference maps reaches from dark blue (low error) via green to yellow (high error). The estimated 
maps are masked according to the ground truth. For visualization in this figure, the resulting images 
have been rotated counterclockwise by 90°. Thus, the color encoding of the normal maps differs from 
that used in the other figures. Here, red represents an upwards orientation and green represents an 
orientation to the left. (Ruf et al. 2021b, Fig. 7.) 
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such as the building facade. When closely comparing the normal maps between SGM" and 
SGM!" 8, slightly smaller stair-casing artifacts can be noticed in case of SGM! 8, which also 
supports the slightly lower error in Table 4.5. A qualitative comparison between the results 
on the 3DOMcity dataset in Figure 4.7, however, does not reveal any noticeable differences 
between the different implementations. The reason for the small Ll-abs error achieved by 
SGM!" on the 3DOMcity dataset is assumed to be due to the fact, that the 3DOMcity dataset 
also contains a subset of nadir images, in which only a small number of slanted surfaces exist 
and the fronto-parallel orientation of the sampling planes coincides with most of the scene 
structure. (Ruf et al. 2021b, Sec. 3.5.) 


To further quantify the strength and weaknesses of the three different SGM aggregation 
strategies, three ROC curves, one for each extension, are plotted for each dataset in Figure 4.8. 
In this, the curves illustrate the error rate achieved by the corresponding SGM extension as 
a function of increasing density of the estimated depth map. The density of the depth map 
is varied by sampling the number of pixels in steps of 5% based on their ordered confidence 
stored in C, going from a high to a low confident estimate. The mean error rate is quantified 
by 1-Accı.os (cf. Equation (4.15)) and states the number of sampled pixels in D whose absolute 
difference to the ground truth exceeds 5 % of the ground truth value. Thus, at a low density 
of D, i.e. a high confidence threshold, the error rate should ideally be at its minimum and 
then increase with increasing density, reaching the overall error of D at a density of 100%. 
The plots start off at a density of 5%, since the error rate at a density of 0% is undefined. 
The curves in Figure 4.8 further support the superiority of the surface-aware SGM extensions 
over the standard SGM adaptation to plane-sweep sampling, as for both datasets the curves 
of SGM and SGM!!?8 are below that of SGM, illustrating smaller error rates. The fact 
that most of the ROC curves start of on a high error rate at a density of 5% and then drop 
down before increasing again, suggests that the estimated confidence values do not represent 
the certainty of the depth estimates appropriately. The reasons for this are manifold and this 
is further discussed in Section 4.5.4. (Ruf et al. 2021b, Sec. 3.5.) 


Since FaSS-MVS aims for incremental and online processing, meaning that the computation 
should ideally keep up with the input stream, the run-time of each SGM extension is also of 
interest. In Table 4.6, the run-time and error of the complete approach, with above-mentioned 
parameterization and with respect to each of the three SGM implementations is listed. The 
measurements were conducted on the dataset of the DTU benchmark. In addition to the 
standard use of 8 aggregation paths, which achieves the lowest error, the run-time and error 
when using only 4 aggregation paths is also listed. In the latter case, the diagonal aggre- 
gation paths are omitted within the SGM aggregation. This is motivated by the fact that a 
number of studies (Banz et al. 2010, Hernandez-Juarez et al. 2016) show that a reduction in 
the number of aggregation paths from 8 to 4 greatly decreases the computation time of the 
SGM aggregation (cf. Section 3.4.1.5), while only marginally increasing its error, which is 
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Table 4.6: Run-time and error measurements of the three SGM extensions of FaSS-MVS with 8 and 4 aggregation 
paths each, conducted on the DTU dataset with an image size of 1600 x 1200 pixels. (Ruf et al. 2021b, 
Tab. 6.) 


SGM! SGM!"  sGMIlrs 


L1-abs 19.832 19.768 19.684 
(in mm) +16.225 +16.192 +16.154 
L1-rel 0.027 0.027 0.027 
8-Path SGM 
+0.021 +0.021 +0.021 
Run-time 640 895 2079 
(in ms) +3 +2 +3 
L1-abs 21.072 21.091 20.908 
(in mm) +16.634 +16.632 +16.599 
L1-rel 0.029 0.029 0.029 
4-Path SGM 
+0.022 +0.022 +0.022 
Run-time 413 546 1132 
(in ms) +2 +2 +3 


also supported by the numbers in Table 4.6. Furthermore, the measurements reveal that es- 
pecially the SGM" 8 extension introduced a great computational complexity compared to 
SGM! and SGM", However, the reduction of the aggregation paths has a great impact on 
the run-time, reducing it by up to 45 %, but only marginally affecting the error. Whether the 
listed run-time is sufficient for online processing is further discussed in Section 4.5.3. (Ruf 
et al. 2021b, Sec. 3.5.) 


Reference Image Fronto-Parallel Sampling Horizontal Sampling 


Figure 4.9: Qualitative comparison between the use of a fronto-parallel and non-fronto-parallel sampling direc- 
tion in combination with SGM! of FaSS-MVS. Columns 2 & 4: Corresponding estimated depth map. 
Columns 3 & 5: Difference map holding the pixel-wise absolute difference between the estimated depth 
map and the ground truth. The color encoding reaches from dark blue (low error) via green to yellow 
(high error). The estimated depth maps and the difference maps are masked according to the ground 
truth. (Ruf et al. 2021b, Fig. 9.) 
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In the second part ofthis experiment, the use of non-fronto-parallel plane orientations within 
the plane-sweep sampling is investigated. Here, an additional horizontal and vertical orien- 
tation, both with respect to the reference coordinate system of the scene, were selected and 
compared to the fronto-parallel sampling direction. For this, the DTU dataset is split into two 
subsets, one for the horizontal sampling, in which the camera is looking in amore downwards 
direction, as well as one for the vertical sampling, where the camera pitch is smaller. As ref- 
erence, for both subsets, results with a fronto-parallel sampling were computed separately. 
The quantitative results reveal a major increase in error when non-fronto-parallel plane ori- 
entations are used for sampling, as also illustrated by the excerpt in Figure 4.9. However, the 
Figure 4.9 also reveals that in areas where the surface structure coincides with the sampling 
direction, e.g. the ground plane, the depth map is very smooth and consistent. (Ruf et al. 
2021b, Sec. 3.5.) 


4.4.6 Improvements Gained by Post-Filtering and Geometric 
Consistency 


In a final ablative experiment, the effects of the implemented post-filtering methods to re- 
move remaining outliers and supposedly wrong estimates by means of difference-of-Gaussian 
(DOG) filtering (cf. Section 2.3.2.2) and enforcing geometric consistency (cf. Section 2.3.3.2) 
are studied. While the latter one relies on the actual estimates, the DOG filter is based on 
the assumption that image regions with low texture might lead to ambiguities in the image 
matching and, in turn, wrong estimates, masking out corresponding regions. This, however, 
might result in the false removal of good or even correct estimates. (Ruf et al. 2021b, Sec. 3.6.) 


Instead of using the absolute and relative L1 metric to quantitatively assess the results 
achieved when employing post-filtering, the effects are evaluated using the accuracy Accg 
(cf. Equation (4.15)) and completeness Cpl, measure (cf. Equation (4.16)). This is because 
they indirectly include information on the density of the resulting depth maps, which should 
ideally be as high as possible. Since the individual sequences of the 3DOMcity dataset are 
made up of too little images in order to perform a geometric consistency check with the 
parameterization mentioned in Section 2.3.3.2, this experiment is only conducted on the 
dataset of the DTU benchmark. Figure 4.10 depicts the results of different post-filtering 
strategies, ie. DOG filtering, geometric consistency filtering, as well as a combination of 
both, executed in combination with the three different SGM extensions and a fronto-parallel 
sampling. In addition, the accuracy-completeness curve resulting from the corresponding 
configurations without post-filtering is also displayed as reference. In the construction of 
the curves, the threshold 0 (i.e. the threshold used to calculate the accuracy and complete- 
ness rates) is varied within the list of 1.25, 1.20, 1.15, 1.10, 1.05 and 1.01. Note that with a 
decreasing threshold, the accuracy and completeness rates drop. Thus, the highest values 
are achieved with 9 = 1.25. (Ruf et al. 2021b, Sec. 3.6.) 
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Figure 4.10: Accuracy-completeness curves of different post-filtering strategies within FaSS-MVS, ie. DOG filtering, geometric consistency filtering, as well as 
a combination of both, executed in combination with the three different SGM extensions and a fronto-parallel sampling. In the construction, the 
threshold © is varied within the list of 1.25, 1.20, 1.15, 1.10, 1.05 and 1.01. With a decreasing threshold, the accuracy and completeness rates drop. (Ruf 


et al. 2021b, Fig. 10.) 
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Table 4.7: F-scores achieved by the three SGM extensions of FaSS-MVS in combination with the post-filtering based 
on geometric consistency. (Ruf et al. 2021b, Tab. 7.) 


Approach F25 Fı2o Fiais Fıio Fios Fo 

(in %) (in %) (in %) (in %) (in %) (in %) 
SGMH 74.2 741 740 737 71.5 519 
SGMH» 74.1 741 74.0 73.6 71.5 523 
SGM" Ps 75.6 755 754 751 72.9 514 


Most evidently, Figure 4.10 again reveals that there is not much difference in the overall 
error between the three SGM implementations. However, the accuracy-completeness curves 
clearly show the differences that the individual post-filtering strategies make. Unsurprisingly, 
the reference configuration where no filtering is employed reaches the highest completeness, 
as no estimates are removed from the predicted depth map, which, in turn, also leads to the 
lowest accuracy. The use of a DOG filtering clearly improves this, as it is likely to remove quite 
a number of wrong estimates originating from low-textured areas. However, as expected, 
the DOG filter presumably also removes a number of correct estimates, as the use of the 
filtering based on geometric consistency achieves similar completeness, while at the same 
time reaching a higher accuracy. Especially when considering the scores for 8 = 1.01, ie. 
the lower left end of each curve, the use of a geometric consistency filter achieves an increase 
in completeness of approximately 5%, while exceeding the accuracy of the use of a DOG 
filter by more than 10%. A distinct recommendation on which filter to use, however, cannot 
be concluded, since both filtering strategies have their strengths and weaknesses, especially 
with respect to online processing as discussed in Section 4.5.3. A combination of both filters 
is not motivated. Even though the accuracy slightly increases, the completeness drops partly 
by over 20%. Moreover, this effect can also be achieved by decreasing the threshold of the 
reprojection error 7, in the geometric consistency check, which probably will increase the 
accuracy even more. (Ruf et al. 2021b, Sec. 3.6.) 


Lastly, to directly compare the different SGM extensions in combination with the geometry- 
based filtering, which achieves the best results, the corresponding F-scores for each evaluated 
O (cf. Equation (4.17)) are listed in Table 4.7. Just as the results displayed in Table 4.5, the 
F-scores reveal the superiority of SGM" over the other two implementations, since for all 
0, except one, SGM! reaches the highest F-score. (Ruf et al. 2021b, Sec. 3.6.) 


4.4.7 Summarized Results and Comparison to Offline Multi-View 
Stereo Approaches 


Finally, before presenting and qualitatively assessing the results of the best configuration on 


real-world data, a short quantitative summary and comparison to the results produced by 
offline MVS is done in the following. As offline MVS approach, the widely used and open 
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source COLMAP toolbox (Schönberger et al. 2016) is used. While COLMAP provides the full 
reconstruction pipeline, i.e. including a subsequent fusion of depth maps into a 3D model, 
only the geometric depth maps were used in order to make a fair comparison, since the fusion 
into a 3D model leads to a further filtering of outliers. The significance of a comparison 
between an online MVS approach, like FaSS-MVS, with an offline approach can be questioned, 
nonetheless, since the two types of approaches make different assumptions and focus on 
different aspects within the processing, as further discussed in Section 4.5.1. (Ruf et al. 2021b, 
Sec. 3.7.) 


Based on the previous experiments, the following setup is selected for the final experiments 
on the DTU dataset. The size of the input bundle is set to |Q| = 5 and the height of the image 
pyramids for the hierarchical processing is set to n = 3. As sweeping direction and plane 
normal for the multi-image plane-sweep sampling, a fronto-parallel plane orientation, i.e. 
n = (00-1)!, is used. For the image matching, the normalized cross correlation with a support 
region of 5X5 pixels is selected and the penalty g; for the subsequent SGM optimization is set 
to 9, = 100, with 9, being adaptively adjusted according to the intensity difference between 
neighboring pixels as described in Section 4.3.3.2. For post-filtering of the resulting maps, 
the geometric consistency check based on the reprojection error achieves the best results and 
is thus selected. Since, the geometric consistency check is not applicable for the dataset of 
the 3DOMcity benchmark, as the individual image sequences of the dataset are too short, the 
final comparison is only performed on the dataset provided by the DTU benchmark. (Ruf 
et al. 2021b, Sec. 3.7.) 


Table 4.8 lists the quantitative results of the summarized experiments, together with the re- 
sults achieved by the geometric depth maps of COLMAP. As to be expected, the results of the 
offline MVS toolbox COLMAP have the lowest error with respect to all accuracy measures, 
since it uses a more global optimization scheme without any run-time constraints and, in 
turn, taking significantly longer to estimate a single depth map as the numbers in Table 4.9 
show. Moreover, in contrast to an online approach, offline processing allows to take all input 
images of the sequence into account and to choose the most appropriate ones for the tasks 


Table 4.8: Final quantitative results of the three SGM extensions of FaSS-MVS on the DTU dataset, in combination 
with fronto-parallel plane-sweep sampling and post-filtering based on geometric consistency check. As 
reference, the quantitative results achieved by the geometric depth maps of the offline MVS toolbox 
COLMAP (Schönberger et al. 2016) are given. (Ruf et al. 2021b, Tab. 8.) 


Approach L1-abs Li-rel Fi25 Fı2o Frias) Fino Fios Fio 
(in mm) (in %) (in %) (in %) (in %) (in %) (in %) 
SGMH 8.549 +7.509 0.012+0.011 74.2 741 740 73.7 71.5 51.9 


SGM!T 8.479 +7.559 0.012 +0.011 741 741 740 73.6 715 52.3 
SGM! Ps 8.722 +7.255 0.013+0.010 75.6 755 754 751 72.9 51.4 
COLMAP 3.745 +5.498 0.006+0.004 80.2 80.2 80.1 80.0 796 74.4 
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Table 4.9: Run-time comparison between the three SGM extensions with 8 aggregation paths and MVS approach of 
COLMAP (Schönberger et al. 2016). Measurements were conducted on the dataset ofthe DTU benchmark 
and represent the mean run-time required by the different approaches to estimate a single depth map. 
While COLMAP allows to only estimate photometric depth maps, the quantitative evaluation with respect 
to the accuracy of COLMAP was done on the geometric depth maps (last column). (Ruf et al. 2021b, Tab. 9.) 


SGM! SGM!Ts  sGM!'rs COLMAP COLMAP 

8-Path 8-Path 8-Path Photometric Photometric+Geometric 

Run-time 640 895 2079 14687 34840 
(in ms) +3 +2 +3 +1020 +1243 


of DIM and MVS. Nonetheless, the results of FaSS-MVS for online MVS are not too far off, 
with the error not even being one order of magnitude higher than that of COLMAP. Interest- 
ingly, while SGM" outperforms the other two SGM extensions with respect to the F-score, 
SGM!" =" has the lowest L1 error. This can be explained by the density of the resulting depth 
maps. When using SGM!!°, more estimates pass the geometric consistency check, resulting 
in depth maps that are slightly more dense than those produced by SGM" and SGMH"! and 
increasing the F-score, while at the same time also increasing the L1 error. Quantitatively 
speaking, the difference, however, is only marginal and a conclusion whether one certain 
SGM extension is to be preferred over the others depends on the use-case and is to be drawn 
based on qualitative comparisons. (Ruf et al. 2021b, Sec. 3.7.) 


In a final comparison, Table 4.10 lists the quantitative results of some representative ap- 
proaches from literature, including COLMAP, on the evaluation set of the DTU dataset, calcu- 
lated with the actual evaluation protocol of the benchmark. Here, in contrast to the accuracy 
(Acc) and completeness (Cpl) used in this work, the Acc and Cpl are calculated based on the 
absolute difference between the estimates and reference. And instead of using the individual 
depth maps, the evaluation protocol calculates the error based on the fused point clouds. Thus, 
the values are not directly comparable to the ones in Table 4.8. However, in both tables, the 
results of COLMAP are listed, allowing for tentative comparison. (Ruf et al. 2021b, Sec. 3.7.) 


Table 4.10: Quantitative results of related work from literature on the DTU benchmark, using the actual evaluation 
protocol of the benchmark. This protocol calculates the errors not on the individual depth maps, but on 
the actual point cloud and specifies the accuracy (Acc) and completeness (Cpl) in absolute differences 
between the estimated points and the reference (lower is better). It is thus somehow comparable to the 
L1-abs error. (Ruf et al. 2021b, Tab. 10.) 


Approach Acc Cpl Overall 

(nmm) (in mm) (in mm) 
COLMAP (Schönberger et al. 2016) 0.400 0.664 0.532 
Furu (Furukawa and Ponce 2010) 0.613 0.941 0.777 
Gipuma (Galliani et al. 2015) 0.283 0.873 0.578 
MVSNet (Yao et al. 2018) 0.396 0.527 0.462 
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4.4.8 Use-Case-Specific Experiments Conducted on Real-World 
Datasets 


Finally, to demonstrate the performance of FaSS-MVS on use-case-specific and real-world 
datasets, experiments using SGM!TPs with 4 aggregation paths and the configuration men- 
tioned in Section 4.4.7 are conducted on the TMB and FB dataset (cf. Section 4.4.1). An excerpt 
of the computed depth, normal and confidence maps is depicted in Figure 4.11, together with 
corresponding depth maps estimated by COLMAP as reference. Again, the experiments and 
timing measurements were conducted on a NVIDIA Titan X. The mean processing time in 
case of the TMB dataset is 690 ms, partly varying from between 320 ms and 1218 ms depend- 
ing on the arrangement of the input data and, in turn, the number of plane distances 6 at 
which the scene is sampled. For the FB dataset, the mean processing time is 800 ms, varying 
between 514 ms and 1419 ms, again depending on the arrangement of the input images. (Ruf 
et al. 2021b, Sec. 3.8.) 
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4.5 Discussion 


In the following, the findings of the conducted experiments are discussed with respect 
to different aspects, namely the overall accuracy and the comparison to offline MVS ap- 
proaches (Section 4.5.1), the ability of FaSS-MVS to reconstruct slanted surface structures 
(Section 4.5.2), the run-time and the support for online processing (Section 4.5.3), as well 
as effects of the employed post-filtering algorithms and the relevance of the confidence 
estimates (Section 4.5.4). (Ruf et al. 2021b, Sec. 4.) 


4.5.1 Overall Accuracy 


As the results in Table 4.8 show, the overall accuracies of the depth maps estimated by FaSS- 
MVS are lower than those of the geometric depth maps from COLMAP, and presumably also 
lower than the results of other offline MVS approaches as the numbers in Table 4.10 suggest. 
This is unsurprising, since the procedure and the assumptions involved greatly differ between 
online and incremental approaches, such as the one presented in this work, and offline ap- 
proaches. Offline approaches, such as COLMAP, assume that all input images are available 
at the time of reconstruction, allowing to optimize the set of input images that are considered 
for the reconstruction of a certain viewpoint. In contrast, online approaches that incremen- 
tally perform MVS only consider input images within a temporally confined window, at most 
all images that were captured up to a certain point in time. Furthermore, offline approaches 
typically also do not have any time constraints. Nonetheless, the quantitative differences be- 
tween the results achieved with FaSS-MVS and COLMAP are not that big, less than an order of 
magnitude, especially when using a geometric consistency filtering in post-processing. Also, 
a qualitative comparison on use-case-specific input data renders the results of the presented 
approach very satisfactory. Compared to the geometric depth maps of COLMAP, the depth 
maps of SGM!!°8 lack the fine-grained details, such as the roof structures in Rows 5 & 6 of 
Figure 4.11, which are caused by the coarse-to-fine processing. Bigger structures, however, 
are well represented and the quality of their reconstruction is comparable, as is the overall 
density. Yet, even though the fronto-parallel bias of SGM is reduced, some artifacts of the 
fronto-parallel sampling are still visible, especially in the normal maps of Figure 4.11. (Ruf 
et al. 2021b, Sec. 4.1.) 


Unfortunately, it is hard to make a quantitative comparison against other online approaches, 
such as the ones presented by Gallup et al. (2007) or Pollefeys et al. (2008), with similar as- 
sumption, aims and use-cases. This is mainly due to the lack of appropriate datasets that 
are publicly available and provide accurate ground truth that in turn allows for a thorough 
investigation and benchmarking. This holds especially with respect to the task of online and 
incremental MVS. Current publicly available datasets and benchmarks used to evaluate the 
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performance of algorithms for the task of DIM and MVS aim at the benchmarking of of- 
fline approaches, providing a fixed set of input images acquired from predefined viewpoints. 
However, in case of online approaches, the selection and sampling of images from an input 
sequence, that are used as input to the approach, have a great effect on the final result. De- 
pending on the configuration of the input images, the range of the scene depth to be sampled 
can be very large, requiring higher Gaussian pyramids in order to keep the required resources 
on a reasonable level and not exceed the run-time constraint. However, a higher number of 
pyramid levels also reduces the initial image size, which in turn will reduce the level-of-detail 
and the accuracy of the resulting depth map. Thus, a clever assembling of the input bundle is 
just as important as the correct height of the Gaussian pyramids. (Ruf et al. 2021b, Sec. 4.1.) 


4.5.2 Ability to Account for Non-Fronto-Parallel Surfaces 


To further increase the accuracy in the reconstruction of slanted, non-fronto-parallel sur- 
face structures, this work proposes, apart from SGM", two extensions to the SGM algorithm 
that should reduce the fronto-parallel bias. Namely the incorporation of surface-normals to 
adjust the zero-cost transition in the SGM path aggregation (SGMH?) and the penalization 
of deviations from the gradient of the minimum cost path (SGM!TPs), The conducted ex- 
periments reveal that these extensions only provide a slight quantitative improvement over 
the standard SGM adaptation (SGM") to the plane-sweep sampling. This insight, however, 
stands in contrast to the experiments conducted by Scharstein et al. (2017). There are at least 
two reasons that could explain this discrepancy. First, Scharstein et al. (2017) demonstrate 
their implementation on a two-view stereo dataset, in which the input images are captured 
by two cameras mounted on a fixed rig and orientated in the same direction. Before being 
processed, the images are also rectified, i.e. transformed so that both lie on the same im- 
age plane and that the epipolar lines coincide with the image rows. Thus, in the process of 
dense image matching, the images are equidistantly sampled with a step-size of 1 pixel. In 
case of FaSS-MVS however, the distances of the sampling planes and, in turn, the sampling 
points are chosen in such a way that the disparity shift along the epipolar line between two 
consecutive planes is less than or equal to 1. This leads to a sampling with a much higher 
density, already reducing the stair-casing effect in case of SGM". And secondly, Scharstein 
et al. (2017) propose to use a ground truth normal map for the adjustment of the zero-cost 
transition, whereas in FaSS-MVS the upscaled normal map of the previous iteration of the 
hierarchical processing is used. This is bootstrapped with SGM" on the highest pyramid 
level, introducing inaccuracies, which probably cannot be fully compensated. The qualitative 
analysis, however, reveals that SGM™*" and SGM!" clearly lead to smoother normal maps 
and that the stair-casing artifacts in the depth maps are reduced, which is also why in case of 
the use-case-specific experiments only SGM! 8 is considered. (Ruf et al. 2021b, Sec. 4.2.) 
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Apart from reducing the fronto-parallel bias in the SGM path aggregation, the plane-sweep 
algorithm within FaSS-MVS allows to adjust the image matching to the surface structures 
in the scene by selecting appropriate normal vectors and sweeping directions. In a short 
qualitative experiment (cf. Figure 4.9), the effects of a horizontal plane sampling, compared 
to a fronto-parallel sampling, both in combination with SGM!! are studied. The results reveal, 
that the horizontal sampling leads to more consistent depth estimates with little or no stair- 
casing artifacts in areas where the surface structure coincides with the plane orientation, e.g. 
the ground plane. In areas, where the surface structures are not horizontal, however, the non- 
fronto-parallel sampling leads to considerable errors. To overcome this effect, a splitting of 
the scene into local regions can be considered, which are sampled individually with different 
plane orientations, similar to the local-plane-sweep approach presented by Sinha et al. (2014). 
This, however, again comes at the cost of a higher computational complexity. Another remedy 
is to repeat the plane-sweep image matching multiple times on the whole image domain prior 
to the SGM optimization, with different sweeping directions and perform a pixel-wise pre- 
selection of the best plane orientation based on the matching costs, similar to the approach of 
Pollefeys et al. (2008). This leads to a smaller increase in computational complexity compared 
to the first option. (Ruf et al. 2021b, Sec. 4.2.) 


4.5.3 Run-Time and Online Processing 


Given the run-time measurements in Table 4.6 and Table 4.9, the presented approach is ob- 
viously not capable of real-time and low-latency processing, in the sense that for each input 
frame a depth map is computed at similar frame rates as given for the input stream. Con- 
sidering the nature of the approach and the expected input data, however, the run-time is 
generally sufficient for online processing, which will be explained in the following section. 
The presented approach takes a bundle of three or more input images, with a bundle size of 
5 images actually yielding better results, and performs a MVS for one reference image of the 
input bundle, which is typically the middle one. While these input images could be provided 
by individual cameras, it is assumed that the images are extracted from an input sequence, 
which is captured by only one camera that is moving around a static scene. In addition, not 
every image of the input sequence can be used, since an appropriate baseline needs to be in 
between each input image in order to allow for the estimation of scene depth. This is obvi- 
ously dependent on the depth range that is to be sampled and the scene structure. In case of 
the TMB dataset, the mean distance between the individual input images is 1.8 m and 1.03 m 
for a flight altitude of 15 m and 8m, respectively. This increases with higher flight altitudes, 
due to a larger scene depth. Modern COTS rotor-based UAVs can fly up to a speed of above 
10 m/s. The typical flight speed when capturing image data, however, is rather 1-3 m/s (DJI 
2020a, DJI 2020b). Thus, if the sets of input images are disjoint, then an estimation needs to 
be performed at least only every 3s, considering a low flight altitude, together with a high 
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flight speed of approximately 3 m/s and an input bundle size of 3 images. If a maximum over- 
lap between the input bundles is desired, meaning that the estimation of a new depth map is 
triggered with every new input frame that is appropriate and that it reuses 4 images from the 
previous bundle, the required run-time is significantly less. As the use-case-specific experi- 
ments for the TMB and FB dataset, however, show, the average processing rate of SGM! v8, 
which is the computationally most complex variant, is between 1-2 Hz, depending on the 
arrangement of the input images. Another possibility to reduce the run-time is the use of 
higher Gaussian pyramids, which again comes at the cost of a reduced level-of-detail as al- 
ready pointed out in the discussion on the overall accuracy (cf. Section 4.5.1). In short, there 
are a number of possible settings in both the acquisition of the input data, e.g. regarding the 
flight speed or size and overlap between the input bundles, and the configuration of the pre- 
sented approach, e.g. regarding the Gaussian pyramid height, depth range or optimization 
strategy, that allow to tweak the run-time to fit the rate of the input images and, in turn, 
allow for online processing. All in all, the numbers in Table 4.9 reveal the superiority of the 
presented approach in terms of run-time compared to state-of-the-art approaches for offline 
multi-view stereo. (Ruf et al. 2021b, Sec. 4.3.) 


The emergence of high-performance SOCs with embedded GPUs, like the NVIDIA Jetson 
series, allow to bring approaches like FaSS-MVS directly onto the sensor carrier, e.g. the 
UAV, for on-board processing. To evaluate the feasibility of running FaSS-MVS on board 
an embedded device, run-time measurements were conducted on the NVIDIA Jetson AGX 
which is equipped with an 8-core 64-bit ARMv8.2 CPU and a 512-core Volta GPU. Executed 
on an excerpt of the TMB dataset with an image size of 1920 x 1080 pixels, FaSS-MVS with 
SGM! and parameterized by the final configuration as presented in Section 4.4.7 achieves an 
average run-time of 727 ms on the Jetson AGX. The average run-time achieved by the same 
setup on the NVIDIA Titan X is approximately 403 ms. As already discussed, the run-time 
can further be reduced by increasing the pyramid height ton = 4 and n = 5, for example, 
while at the same time accepting a decrease in the quality of the results. This results in an 
average run-time of 444ms and 385 ms, respectively. These experiments show, that FaSS- 
MVS is capable of on-board processing by utilizing a high-performance embedded SOC like 
the NVIDIA Jetson AGX. This can be of particular interest when considering a deployment 
on a sensor-carrier that does not suffer from such strong energy constraints as COTS UAVs. 
(Ruf et al. 2021b, Sec. 4.3.) 


4.5.4 Post-Filtering and the Relevance of the Estimated Confidence 
Values 


In the following section, the improvements gained by the post-filtering based on the DOG 


filter and geometric consistency and the effects the filtering has on the online processing, 
as well as the relevance and the expressiveness of the confidence estimates are discussed. A 
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comparison ofthe L1 errors in Table 4.8 to those listed in Table 4.5, reveals that with the use of 
post-filtering based on geometric consistency, the mean errors can drastically be reduced by 
approximately 40 %. The prize for this improvement, however, is aloss in density ofthe depth 
map and increase in the latency between the input and the results. The latter one is due to the 
additional sliding window that is introduced by the geometric-consistency-based filtering. In 
addition to the bundle of input images, to which only one set of estimates is produced, the 
geometric filter further requires two or more depth maps for processing. Furthermore, the 
geometric filter is computationally more complex than the DOG filter. Again, whether to 
use the DOG or geometric filter depends on the application. If the presented approach is for 
example used for the task of online 3D reconstruction, meaning that a subsequent depth map 
fusion step is employed (Hermann et al. 2021b), the geometric-consistency-based filtering is 
typically done in the fusion of depth maps and can thus be omitted. The DOG filter on the 
other hand is very efficient and does not introduce additional latency. However, as already 
mentioned in the experiments, the DOG filter might also remove potentially good estimates, 
as it is only executed based on the data provided by the input image. Nonetheless, especially 
when working with input data that contains a lot of homogeneous areas with little to no 
textures, e.g. a clear or cloudy sky in case of extreme oblique viewpoints, the DOG filter is 
of great benefit. (Ruf et al. 2021b, Sec. 4.4.) 


Lastly, as a third output, FaSS-MVS computes a confidence map containing pixel-wise confi- 
dence scores corresponding to the depth estimates. In the scope of this work, these confidence 
measures are used to perform a comparison between the different SGM extensions based on a 
ROC analysis (cf. Figure 4.8). As already pointed out, the fact that some of the curves are not 
monotonically increasing suggests that the confidence values do not represent the certainty 
of the estimates appropriately. For one, the reason for this might lie in the normal map. Since 
confidence values are calculated based on the surface orientation stored in the normal map, 
errors in the normal map inevitably lead to unreliable confidence estimates. However, the 
qualitative excerpts in Figure 4.6, Figure 4.7 and Figure 4.11 object this explanation. For ex- 
ample, solely the fact that the scene in Row 6 of Figure 4.11 mostly consists of fronto-parallel 
structures leads to a confidence map with high certainty values, while the confidence map 
in Row 2 in Figure 4.11 renders the estimation of the building roof, which qualitatively ap- 
pears very accurate, as fully uncertain. Similar observation can be seen in the confidence 
maps depicted in Figure 4.6. Only because the ground plane is greatly slanted with respect 
to the image plane, the confidence of the corresponding estimates is rendered very low even 
though qualitatively they do not appear more accurate than the estimates on the building 
facade. The most likely reason is that the modeling of a confidence score based on the sur- 
face orientation alone is not very expressive. An incorporation of additional heuristics, that 
are based on internal characteristics of the algorithm, as done in previous work (Ruf et al. 
2019), might improve the certainty estimation, but this still requires a cumbersome empirical 
study of the hyper-parameters. In recent years, however, the performance of learning-based 
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approaches for the task of confidence estimation (Poggi et al. 2020, Heinrich and Mehltretter 
2021) has greatly increased. They are often agnostic to the internals of the algorithm and can 
be trained on any data for which both estimated and reference depth or disparity maps are 
available. (Ruf et al. 2021b, Sec. 4.4.) 


4.6 Conclusion 


In conclusion, with FaSS-MVS an approach is presented for multi-view stereo (MVS) from 
UAV-borne imagery that allows to facilitate fast, dense and incremental 3D mapping. This 
approach consists of a hierarchical processing scheme, in which dense depth maps as well as 
corresponding normal and confidence maps are estimated. For the depth map computation, 
dense multi-image matching, by utilizing the plane-sweep algorithm, is used to produce pixel- 
wise depth hypotheses. From these hypotheses, a dense depth map is extracted by adopting 
the optimization scheme of the widely used Semi-Global Matching (SGM) algorithm. Here, 
the SGM algorithm is not only adapted to work with the multi-image matching of the plane- 
sweep algorithm, but also extended in order to reduce the fronto-parallel bias and, in turn, also 
account for slanted surface structure, by introducing two additional regularization schemes. 
The successive normal and confidence map estimation is done separately on the results of the 
depth estimation. In a final filtering step, geometric consistency over multiple depth maps 
is enforced, which greatly increases the overall accuracy of the resulting depth maps. (Ruf 
et al. 2021b, Sec. 5.) 


The performance of FaSS-MVS is quantitatively evaluated on two public datasets contain- 
ing image data of model-scaled scenes, captured from an aerial perspective and providing 
an accurate ground truth. The experiments show that, for the best configuration, the esti- 
mated depth maps have a mean absolute L1 error of only 8.5 mm, that is 1% with respect to 
the maximum depth of the reconstructed scene. In comparison, the geometric depth maps 
from COLMAP, a widely used open-source toolbox for offline MVS, achieve a mean absolute 
error of 3.8mm. Thus, even though FaSS-MVS does not have all image data of the input se- 
quence available during the time of reconstruction and is subjected to run-time constraints 
in order to ensure fast and online processing, its quantitative results are not too far off from 
state-of-the-art offline approaches. While the quantitative results do not show a significant 
improvement by the presented SGM extensions to account for slanted surface structures, a 
qualitative comparison reveals their ability to account for non-fronto-parallel surfaces. Thus, 
in case of oblique aerial imagery, which contains a lot of slanted surfaces, the presented SGM 
extension that penalizes deviations from the gradient of the minimum cost path, i.e. SGM", 
is the best choice, despite its computationally higher complexity. Concluding experiments on 
real-world and use-case-specific datasets have shown, that in terms of run-time the presented 
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approach is well-suited for online processing by achieving a processing rate of 1-2 Hz, mean- 
ing that it keeps up with the monocular input stream and allows for incremental 3D mapping, 
while the input data is being received. A fast 3D mapping, in turn, can facilitate other im- 
portant applications or tasks, such as the fast assessment of inaccessible areas by emergency 
forces, e.g. after a flood or earthquake, in order to accomplish disaster relief or SAR missions. 
(Ruf et al. 2021b, Sec. 5.) 


Finally, there are also some aspects to be considered in future work. Even though FaSS-MVS 
supports different plane orientations in the plane-sweep multi-image matching, to this end, 
each estimation is done with only one orientation. In future, the approach should be ex- 
tended to use multiple plane orientations within the computation of a single depth map. This 
allows to reconstruct large planar surfaces, such as the ground plane more smoothly, but also 
allows to maintain higher accuracy in other regions by using fronto-parallel sampling. Fur- 
thermore, although it is stated that the processing rate is sufficient, a further decrease in run- 
time and a more efficient use of GPU resources would make up more opportunities for other 
concurrent tasks, such as the fusion of depth maps or the generation of orthographic pho- 
tos. Thus, further optimization in terms of run-time and utilization of processing resources 
is an ongoing task. Moreover, due to the ongoing development and fast advancements of 
deep-learning-based approaches for the task of MVS, it is to be investigated, whether indi- 
vidual steps or even the whole approach can be substituted by an appropriate learning-based 
approach, while maintaining the reliability for the use in the scope of critical applications. 
Lastly, in their work, Nex and Rinaudo (2011) have shown that the complementary use of Li- 
DAR and image-based techniques for photogrammetric tasks has great potential. In addition, 
with the improvements of LiDAR sensors and the possibility of equipping more and more 
COTS UAVs with such sensors, like the Zenmuse L1', their use to facilitate fast and incre- 
mental 3D mapping is thus inevitably to be considered in future work. (Ruf et al. 2021b, 
Sec. 5.) 


As part of the framework that is proposed in this work (cf. Section 1.2), FaSS-MVS aims at 
providing depth estimates in order to allow a rapid 3D mapping from monocular video data 
captured by COTS UAVs, which in turn aids first responders in the assessment of a disaster 
site. The embedding of FaSS-MVS into a processing pipeline for rapid 3D mapping is discussed 
in Section 6.2. Moreover, the estimated depth data allows to perform 3D change detection for 
damage assessment, which is discussed in Section 6.3. 


A major weakness of monocular MVS approaches, and in turn also of FaSS-MVS, is the in- 
ability to handle dynamic objects, such as pedestrians or cars, that are moving around in the 
scene during data acquisition. A possible remedy is to use approaches for two-view depth 
estimation from stereo cameras. However, due to the confined baseline of stereo cameras that 


* https://www.dji.com/de/zenmuse-11 
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can be fitted to COTS rotor-based UAVs, the depth range is limited. Another possibility to 
handle dynamic objects is the prediction of depth estimates from a single image by learning- 
based approaches. Such an approach for so-called single-view depth estimation is presented 
and discussed in the following chapter. 
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In the previous chapters, the process of dense depth estimation is formulated by dense image 
matching (DIM), meaning that a dense correspondence field between two or more input im- 
ages, which depict the scene from different vantage points, is to be found. From this, depth 
estimates can be deduced if the relative transformations between the images are known. The 
process of depth estimation by means of DIM is accompanied by the assumption that all 
images depict the same scene. In the case of monocular multi-view stereo (MVS) (cf. Chap- 
ter 4), this means that the scene geometry is not allowed to change between the individual 
images, making the reconstruction of scenes containing dynamic objects, e.g. an urban area 
with moving pedestrians or cars, cumbersome to reconstruct. When using a two-view stereo 
setup (cf. Chapter 2), on the other hand, the consistency of the depicted scene in the two im- 
ages is enforced by a time-synchronized image acquisition, thus allowing to also estimate the 
depth for dynamic objects. However, in this case, the maximum depth that can be estimated 
is tightly coupled to the baseline between the cameras, which is relatively limited if the sys- 
tem is to be mounted to COTS UAVs (approximately 11.5 cm on the DJI Matrice 210v2 RTK), 
making such systems only feasibly for close-range photogrammetry and image exploitation. 


Motivated by the capabilities of humans to guess relative depth estimates from a single image 
of a known scene, increasing effort has lately been put into attempts to learn how to estimate 
depth maps from single input images by means of deep learning (Eigen et al. 2014, Li et al. 
2015, Godard et al. 2017). The idea is that, similar to the empirical knowledge of humans, 
deep convolutional neural networks (CNNs) are able to learn discriminative image cues, from 
which relative depth information can be inferred. This should especially hold, if the scene of 
the new image is somehow known to the CNN, i.e. if it is from the same dataset or at least 
similar to the scenes covered by the training data. In literature, this process is often referred 
to as monocular depth estimation. However, in order to not overload the term “monocular”, 
since in this work it is already used within the context of performing MVS from images of a 
single moving camera, this process will be referred to single-view depth estimation (SDE). 


Apart from the already mentioned advantage of allowing to reconstruct dynamic scenes and 
possibly larger depth ranges compared to two-view stereo methods, there are a number of 
other advantages to SDE approaches. For one, the depth estimation by means of SFM and 
MVS is unstable if the camera has a large focal length, and thus a narrow field-of-view, or if 
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the camera is moving in the direction of the optical axis. Moreover, unlike depth estimation 
from single images, approaches that rely on DIM typically suffer from occlusions, especially 
in case oftwo-view stereo. Lastly, amajor advantage of SDE approaches over those relying on 
monocular MVS is their potentially higher processing speed. While MVS approaches require 
at least two images that are captured with an appropriate baseline, SDE approaches only 
require a single image to estimate the scene depth and thus are able to operate at a higher 
frequency. 


Early work on SDE (Eigen et al. 2014, Li et al. 2015, Laina et al. 2016) relies on a supervised 
training, meaning that a reference depth map is required to evaluate the loss function and, 
in turn, to train the CNN. These studies show, that state-of-the-art approaches for SDE can 
outperform conventional approaches if appropriate training data is available. The dependence 
on such data is at the same time, however, one of the major drawbacks of SDE approaches. 
This is due to the fact that the acquisition of datasets, which provide a large set of accurate 
reference data that is appropriate to perform supervised training of CNNs for the task of 
single-view depth estimation, is cumbersome and costly. Thus, only a few datasets exist that 
are appropriate for the desired task, e.g. those presented by Silberman et al. (2012), Menze 
and Geiger (2015) or Schöps et al. (2017), all of which are made up of terrestrial imagery and 
thus are not suited to learn to estimate depth from a single aerial image. 


In order to overcome this limitation, recent studies have proven that CNNs can also be trained 
for the task of SDE by self-supervision. Self-supervision means, that in the training process, 
such approaches only require images from a stereo camera or from a video captured by a 
single moving camera, similar to the input data required by conventional stereo and multi- 
view stereo approaches. This can be accomplished by posing the task of depth estimation 
during training as a view synthesis and image reconstruction problem. In this, one or more 
neighboring images are transformed into the view of a reference camera, based on a predicted 
depth map, and matched with the corresponding reference image. The assumption is, that the 
CNN has learned to predict the depth correctly if the synthesized image and the reference 
image coincide, i.e. if the matching costs are at their minimum. Thus, such approaches do 
not require any special training data, because they supervise their training by themselves. 
Such approaches are referred to as self-supervised single-view depth estimation (SSDE) or self- 
supervised monocular depth estimation (SMDE) approaches. The flexibility regarding training 
data allows to use SSDE on any domain without the need of costly acquired training data, 
possibly eliminating a major impediment of deep-learning-based approaches for the practical 
use in the scope of depth estimation. 
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5.1 Contributions and Outline 


In this chapter, it is studied, how modern SSDE or SMDE approaches, which have achieved 
great results in the context of autonomous driving and driver assistance (Godard et al. 2017, 
Wang et al. 2018, Mahjourian et al. 2018, Godard et al. 2019), can be adopted for the task 
of learning single-view depth estimation from aerial imagery captured by a COTS UAV. To 
this aim, 


e an approach is presented that demonstrates the feasibility to train a CNN to estimate 
a depth map from a single aerial image, solely based on images from a single moving 


camera, 


e the depth maps estimated by the proposed SSDE approach are compared and 
evaluated against the results from conventional approaches based on DIM and MVS, 
and 


+ the ability for generalization of such an SSDE approach, i.e. the performance achieved 
on images from unknown scenes, is investigated and the possible practical use for the 
task of depth estimation is discussed. 


The contributions presented in these chapters have partly been published in: 


« Hermann, M.; Ruf, B.; Weinmann, M., and Hinz, S. (2020): “Self-supervised learning 
for monocular depth estimation from aerial imagery”. In: ISPRS Annals of the Pho- 
togrammetry, Remote Sensing and Spatial Information Sciences V-2-2020, pp. 357-364. 
Peer-reviewed on the basis of the full paper. Cited as (Hermann et al. 2020). 


This chapter is structured as follows: In Section 5.2, related work on deep-learning-based 
approaches for the task of depth estimation from a single image is discussed, both by super- 
vised and self-supervised training. This is followed by a detailed description in Section 5.3 
focusing on the self-supervision process that is employed in this work. Details on the infer- 
ence of depth maps from a single image by an appropriately trained CNN as well as details 
on the implementation are given in Section 5.4 and Section 5.5, respectively. The experi- 
mental results are presented and discussed in Section 5.6 and Section 5.7, respectively, before 
providing concluding remarks in Section 5.8. 


5.2 Related Work 


In the following sections, the related work on deep-learning-based approaches for the task 
of SDE is presented and briefly discussed. First, Section 5.2.1 introduces related work that is 
based on supervised learning, i.e. approaches that rely on a reference depth map to train a 
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CNN for the task of single-view depth estimation. This is followed by the presentation of ap- 
proaches that rely on self-supervised learning in Section 5.2.2, before providing an overview 
on related work that relies on CNNs, that are trained in a self-supervised manner for the task 
of estimating depth maps from a single aerial image in Section 5.2.3. 


5.2.1 Supervised Learning of Single-View Depth Estimation 


As already stated, with the advancements achieved by CNNs, increasing effort is put into 
trying to train a deep-learning-based model to predict a depth map from a single input im- 
age. This is motivated by the ability of humans to deduce the scene structure from a single 
image based on their empirical knowledge. Early work started off by training deep CNNs 
in a supervised manner, requiring reference data to generate a training signal. Eigen et al. 
(2014) present an approach that relies on a cascade of two CNNs, where the first network 
focuses on the estimation of a coarse depth map based on global prediction, which is refined 
by the second network to produce the final result. In further work, Li et al. (2015) propose 
to use a CNN to regress deep features from the input image, which are then refined by a 
hierarchical conditional random field (CRF) to produce the final output. In their work, Laina 
et al. (2016), on the other hand, rely on a single CNN, with an encoder-decoder structure 
based on the ResNet50 (He et al. 2016), that is trained end-to-end without the need of post- 
processing by conventional methods. A great impediment of such approaches, however, is 
their dependence on training data containing reference depth maps, of which only a small 
number exist, e.g. those presented by Silberman et al. (2012), Menze and Geiger (2015), and 
Schöps et al. (2017), all of which consist of ground-based imagery. One method to overcome 
this limitation is to use synthetically generated training data (Johnson-Roberson et al. 2017, 
Mayer et al. 2018). However, the process of creating large amounts of training data, which is 
realistic enough to allow generalization of the resulting model to real-world scenes, is cum- 
bersome, time-consuming and costly. Moreover, the generalization of a trained model across 
different domains, be it between synthetic and real-world scenes or between terrestrial and 
aerial imagery, is still very challenging and does not always achieve satisfying results. 


5.2.2 Self-Supervised Learning of Single-View Depth Estimation 


Self-supervised techniques provide another possibility to overcome the limitation of requir- 
ing appropriate training data. Such approaches formulate the training task as a novel view 
synthesis and image reconstruction problem and thus supervise themselves in the training 
for the task of depth prediction. Early works (Flynn et al. 2016, Xie et al. 2016) train a model 
to synthesize images from new viewpoints by estimating the scene depth and transforming 
the available imagery into the new vantage points. In this, the prediction of depth is just an 
auxiliary task in order to correctly sample new images, which requires an understanding of 
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the scene geometry. However, this technique can also be used to facilitate the learning for 
the task of depth map prediction by a subsequent comparison between the newly synthesized 
image and the actual image captured by a camera from the corresponding viewpoint. Khot 
et al. (2019), for example, employ such a training technique to learn how to predict a depth 
map from multiple images. But it can also be used to train models for the task of single-view 
depth estimation, as shown, for example, by Godard et al. (2017) and Godard et al. (2019). 


A key requirement to sample synthetic images from a new viewpoint are the intrinsic camera 
parameters as well as the extrinsic transformation between the individual input images. This 
makes approaches which are based on images from a fixed stereo camera setup, like the one 
of Godard et al. (2017), particularly suitable, since the relative transformation between the 
cameras can be calibrated a priori and stays the same for all images. However, as previously 
already explained, depending on the use-case and the application, a stereo camera setup is 
not always suitable, due to the limited baseline. Thus, other approaches (Mahjourian et al. 
2018, Wang et al. 2018, Zhao et al. 2016, Godard et al. 2019) rely on the paradigm of monocular 
SFM and MVS and only use images of a single moving camera during training. While this, in 
turn, allows to train CNNs for the task of SDE directly from video data, it requires the model 
to additionally learn how to predict the extrinsic orientations between the input images, in- 
creasing the number of parameters to train. Nonetheless, the independence of additional 
reference data during training makes such approaches especially fit for the learning-based 
single-view depth estimation from aerial imagery captured by COTS UAVs, since the acqui- 
sition of adequate training material is much easier. 


5.2.3 Self-Supervised Learning of Single-View Depth Estimation 
from Aerial Imagery 


The majority of approaches for the task of SSDE arise from the context of autonomous driving, 
which is, amongst other reasons, due to the prominent availability of appropriate datasets, 
such as the KITTI (Menze and Geiger 2015) or Cityscapes (Cordts et al. 2016) benchmarks. 
Due to higher degrees of freedom, i.e. varying flight altitude, camera movement and orien- 
tation as well as downwards looking perspective, aerial imagery is, however, significantly 
different compared to recordings of street scenes. In their work, Knöbelreiter et al. (2018) 
demonstrate, as one of the first, the use of self-supervised learning for stereo reconstruction 
from aerial imagery. However, they use a rectified stereo image pair as input data and refine 
the estimated depth map by a subsequent hybrid method, relying on a deep CNN and a CRF 
(Knöbelreiter et al. 2017). 


The approach for SSDE from aerial imagery presented in this work, in contrast, aims to 
directly use and evaluate the predicted depth maps without any post-filtering or post- 
optimization. Moreover, similar to the latest approaches that derive from the context of 
autonomous driving (Mahjourian et al. 2018, Wang et al. 2018, Zhao et al. 2016, Godard et al. 
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2019), the presented approach only uses image data from a single moving camera mounted to 
a COTS UAV during training. The approach presented by Madhuanand et al. (2021) also relies 
on video data captured by a camera mounted to a moving COTS UAV for the self-supervised 
training of a CNN for SDE from aerial imagery. In contrast to the work presented in this 
chapter, it uses a different network architecture based on a 2D CNN encoder and a 3D CNN 
decoder as well as two additional loss functions. This slightly improves the performance of 
the resulting depth maps compared to the approach that is presented in this chapter and 
which was published in the paper of Hermann et al. (2020). 


5.3 Methodology of Learning Single-View Depth 
Estimation by Self-Supervision 


In the following section, the actual procedure of self-supervised learning for the task of SDE 
implemented in the scope of this work is described and will be denoted as SSDE. In this, first an 
overview over the complete procedure is given, before explaining details of each single step. 


Similar to the approach described in Chapter 4 for Fast Multi-View Stereo with Surface-Aware 
Semi-Global Matching (FaSS-MVS), the training of the SSDE approach requires a set of image 
bundles consisting of 3 or more images from an input sequence that depict a static scene 
from different vantage points. Just as for approaches that perform depth estimation based on 
DIM, it is crucial that there is enough parallax, i.e. perspective difference, between the input 
images in order to allow depth estimation, while ensuring only a little amount of occluded 
areas which hinder the process of image matching. While the compliance of this criterion 
can be enforced by an intelligent pre-selection of the image data based on image content and 
optical flow methods (Hermann et al. 2021a), it is empirically determined in the scope of this 
work. Again, the center image of the input bundle is referred to as reference image J,.r for 
which a depth map D is to be predicted, while the rest of the images J, are grouped into 
a left (k < ref) and a right (k > ref) subset with respect to Jef. While it is assumed that 
the intrinsic camera matrix K is given, the relative extrinsic transformations E,.- ,, = [R T] 
between the images are not required. 


Every single training iteration is subdivided into four consecutive steps as illustrated by Fig- 
ure 5.1. Step (1) performs single-view depth estimation to predict the depth map D from 
Tre. In step (2), the relative transformations E,,.¢ ,, consisting of rotation R and translation T 
between the reference view (corresponding to Jef) and its neighboring views are predicted. 
Given the depth map D and the relative transformations E,.r,x, new synthetic images corre- 
sponding to the reference view are sampled from the neighboring images in step (3). Finally, 
the newly sampled reference images are matched against the actual reference image I, in 
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Figure 5.1: Illustration of the self-supervised training process for the task of single-view depth estimation. Step (1) 
predicts the depth map D from J,.r. Step (2) predicts the relative transformations E,..x (with k = +1) 
between the reference view, corresponding to J,.r, and its neighboring views, indicated with I | and J, 
which represent a preceding and following view of the image sequence, respectively. Step (3) samples 
synthetic images given the neighboring images as well as the predicted depth D and the predicted 
relative transformations E,.r.x. Step (4) matches the synthetic views against J,.¢ and calculates the 
reconstruction error. (Hermann et al. 2020, Fig. 2) 


step (4), calculating the reconstruction error and in turn the training loss, which is then back- 
propagated through the CNN. This procedure is executed in each training iteration and thus 
repeated thousands of times within the complete training of the CNN for the task of single- 
view depth estimation. At the beginning of the training, the predictions of D and E,.5,, 
will obviously be of very low quality, resulting in a high matching score and, in turn, a high 
training loss. With ongoing training, the quality of the predictions will improve more and 
more, while the network learns how to minimize the training loss by predicting the corre- 
sponding data correctly. When the training loss, and with it the matching cost between Jef 
and the synthetic image sampled from the neighboring views, has reached its minimum, it is 
assumed that the CNN has learned to predict the data correctly, as a new view can only be 
convincingly sampled, if the scene geometry and extrinsic camera positions are correct. In 
the following sections, each step of the outlined procedure is described in detail. 


5.3.1 Single-View Depth Estimation 
As a network topology for the CNN responsible to estimate a depth map D from a single 


input image, the U-Net architecture introduced by Ronneberger et al. (2015) is adopted and 
illustrated by Figure 5.2. The U-Net is made up of an encoder, which transforms the input 
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Figure 5.2: Illustration of the structure of the U-Net employed for single-image depth estimation. It comprises an encoder and a decoder. As encoder the ResNet18 
architecture is used, while for the decoder a simple 2 X 2 nearest neighbor upsampling is employed. As activation functions in the individual layers of 
the decoder, exponential linear units (ELUs) are employed. To compute the depth map, a final Sigmoid activation is used. 
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image into a deep feature representation, from which the output is then generated by the 
decoder. As encoder, the ResNet18 architecture (He et al. 2016) is used, since it achieves a 
higher accuracy than other architectures, such as the VGG-Net16 (Simonyan and Zisserman 
2015), while at the same time having less trainable parameters, as discussed in Section 5.6.4.1. 
To further reduce the number of trainable parameters, the approach of Godard et al. (2017) 
is employed, that relies on simple nearest neighbor upsampling instead of a transposed con- 
volution. So-called skip connections between individual layers of the encoder and decoder 
ensure that features, that have been learned in early stages of the CNN, also have an impact 
to the final output. In this, the features of the encoder are concatenated with the correspond- 
ing features of the decoder. As activation functions in the individual layers of the decoder, 
exponential linear units (ELUs) are employed. To compute the depth map, a final Sigmoid 
activation is used. 


5.3.2 Relative Pose Estimation 


In the second step of the self-supervised training, the relative extrinsic transformations 
E,ex, Le. relative poses, between the view of the reference camera and the neighboring 
views, and in turn between the corresponding images, are estimated. While this transforma- 
tion estimation could also be done using image features, the presented approach pursues to 
rely on deep learning to estimate the relative poses, allowing for a full end-to-end training, 
similar to that presented by Zhou et al. (2017). In this, each additional input image is also 
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Figure 5.3: Illustration of the network architecture to estimate the relative transformation between images from 
the reference view and its neighbors. The encoder is shared between the networks responsible for the 
depth estimation and relative pose estimation, illustrated by the same color and the white arrows. The 
deep feature maps from the encoders are then concatenated before being passed through three fully 
connected layers and a final Sigmoid activation layer. (Hermann et al. 2020, Fig. 3) 
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transformed into a deep feature representation by an encoder CNN, just as the reference im- 
age in the scope of predicting the depth map (cf. Section 5.3.1). In order to keep the resulting 
model at a small size, the weights of the depth estimation encoder and the relative pose esti- 
mation encoders are shared (cf. Figure 5.3). This means that for the computation ofthe deep 
feature representation, the same layers ofthe U-Net encoder are used for both depth and pose 
estimation. This also has a positive effect on the time it takes to train the model, since less 
parameters need to be trained. The sharing of the encoders is also motivated by the fact that 
the depth and pose estimation are both tightly coupled to the scene geometry, which should 
ideally be correctly encoded in the deep feature maps. As illustrated by Figure 5.3, the deep 
features of all input images, calculated by the corresponding encoders, are concatenated and 
then passed through a sequence of three fully connected layers, followed by a global average 
pooling in the last layer ofthe network. The output is a vector containing the rotational and 
translational parameters of R and T respectively, both relative to the reference view. 


5.3.3 Image Projection 


In order for the approach to compute a training loss and, in turn, for the CNN to learn the 
parameters for predicting the depth map D and the relative transformations E,. ‚x, the CNN 
needs to know how to project the images from the neighboring views J; (in the following 
denoted as matching images) into the view of the reference image J,.. To facilitate complex 
processes like image projection and image sampling, Jaderberg et al. (2015) proposed the 
concept of spatial transformer networks (STNs). Instead of actually training a network to 
learn specific algorithm, Jaderberg et al. (2015) propose to use STNs to learn the parameters 
of a predefined algorithm. For example, in the case of image rotation, the STN does not 
have to learn how to actually rotate an image but only by how much it is to be rotated. An 
STN is composed of a Localization Net, a Grid Generator and a Sampler. The Localization 
Net learns the parameters of a predefined transformation. Within the Grid Generator, the 
target coordinates of a regular grid are then transformed into destination coordinates using 
the previously learned parameters. Lastly, given these destination coordinates, the Sampler 
samples the new image. 


Given the intrinsic calibration matrix K, the projection of a matching image J, into the view 
of the reference image Jef is formulated as: 


Tk ref = Ik [P(D, Eyep sks K)] . (5.1) 


Here, P denotes the predefined projective transformation, for which the parameters in the 
form of the predicted depth map D and the predicted relative extrinsic transformations Eyer, x 
are to be learned. As part of the actual sampling [-], that transforms J; to create the new im- 
age, a bilinear interpolation is employed. As illustrated by Figure 5.2, four individual depth 
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maps with four different resolutions are computed after each upsampling layer of the de- 
coder. The use of multi-scale estimation allows to stabilize the training. In this, instead of 
downsampling J,.¢ to the resolution of D, the approach proposed by Godard et al. (2019) to 
upsample D to the resolution of J, er is used. 


5.3.4 Image Matching and Loss Calculation 


Since this approach is not based on supervised learning, it does not rely on a reference depth 
map with respect to which a loss function is formulated. In contrast, as previously described, 
the network is trained by a loss function L, which compares the projection of a matching 
image J, into the view of the reference image J,. given the predicted depth map D and 
relative transformation E,.¢_,,. Thus, the employed training loss does not allow a direct rea- 
soning about the quality of the predicted depth map based on a given reference. Nonetheless, 
induced by the nature of the approach, a good image projection and sampling is only pos- 
sible if D and E,,¢,, are predicted correctly. And thus, the loss function L is formulated 
to evaluate the quality of the image projection and sampling, similar to the cost function in 
the scope of the plane-sweep multi-image matching (cf. Section 4.3.2). In particular, it is 
formulated according to: 


L= L photo +A Lsmooth> (5.2) 


with Lyhoto being a photometric loss modeling the quality of the image reconstruction, and 
with Lsmooth being a smoothness loss enforcing smooth depth maps. The weighting parameter 
is empirically set to A = 0.0001. In the following, the calculation of the photometric and the 
smoothness loss is described in detail. 


Photometric Loss: 


Just as in the plane-sweep multi-image matching of FaSS-MVS (cf. Section 4.3.2), the approach 
of Kang et al. (2001) to account for occlusions is used, by separately calculating the loss for 
the left and right subset of J,.¢ and taking the minimum of both for the final loss, since it can 
again be assumed that areas, which are occluded in one subset, are visible in the other. Thus, 
the photometric loss is the minimum of the left and right reconstruction error s according to: 


Lohoto = min > STI ret) ’ > SCT Treg) , with 
k<ref k>ref (5.3) 


1 —SSIM(Jq, Jp) 


S(Jq,Ip) = a 2 


+A -= a) |Ja — Dl. 
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Here, the reconstruction error s is comprised of a structural similarity (SSIM) measure and a 
L1-Norm, similar to the work of Godard et al. (2017), Mahjourian et al. (2018), and Madhua- 
nand et al. (2021). The SSIM measure is evaluated by Zhao et al. (2016) to provide visually 
appealing results for the task of image reconstruction by deep CNNs and is often used in the 
scope of view synthesis. As in the approach presented by Godard et al. (2017), the Gaussian 
kernel within SSIM is substituted by a 3X3 box filter, leading to a simpler implementation and 
a more efficient processing. The cı and c, parameters of SSIM are set to 0.0001 and 0.0009, 
respectively. In the experiments, the weighting factor between the SSIM and the L1-Norm is 
set to a = 0.15, yielding a more stable convergence of the training. 


Smoothness Loss: 


While the smoothness loss aims to enforce a smooth depth map D and helps to compensate 
the inability to calculate the photometric loss in weakly-textured areas, it is also desired to 
preserve depth discontinuities at object boundaries. Thus, in order to not smooth over edges 
too strongly, an edge-aware smoothness loss is employed, like that presented by Wang et al. 
(2018). Just as with the adjustment of the SGM penalty @ (cf. Section 4.3.3.2), the aim is 
to adjust the loss function based on the image gradient in J,,.¢. Thus, the smoothness loss is 
formulated according to: 


pD 


ô D 
Lemooth = 5 


exp (— [du Freel) + exp (— |p Freel) - (5.4) 


In this, the color gradient in u- and v-direction of Jef is used to weight the smoothness and, 
in turn, the gradient of the depth map D. Moreover, to account for the problem of a degrading 
depth map, the approach of depth normalization D/D presented by Wang et al. (2018) is used. 


5.4 Inference of Depth from a Single Aerial Image 


The presented approach aims at predicting a depth map from a single aerial image. And 
thus, there is no need for an image bundle to estimate the depth map. Consequently, the 
fully connected layers, used to estimate the relative transformations E,._., between the ref- 
erence image and the neighboring images during training, are not required during the time 
of inference. Solely, the U-Net, which predicts the depth map D within the first step of the 
training routine, is used during inferencing. This again has a positive effect on the process- 
ing speed, as the number of parameters and, in turn, the size of the model is further reduced. 
However, as discussed in Section 5.7.3, when the model is to be used and retrained on a new 
scene, the whole network, including the part which is responsible for predicting the relative 
transformations, is needed. 
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5.5 Implementation Details and Training Procedure 


An implementation of the presented approach in Python, using the deep learning framework 
TensorFlow, is released as open-source on GitHub’. During training, the Adam optimizer 
(Kingma and Ba 2015) is used with 6; = 0.9 and ß, = 0.999. The learning rate is set to 
0.0002. While Godard et al. (2019) use a batch size of 4 to train their approach on data of 
the KITTI benchmark, the training of the presented approach on aerial imagery does not 
converge with a batch size of 4 (cf. Section 5.6.4.2). Thus, in the scope of this work, a batch 
size of 20 is used. As input bundle, three input images are used. During training, the model 
of the presented approach has approximately 21 million parameters. However, since only the 
depth prediction network is required during the time of inference, provided that no continual 
learning is required, this can be reduced to 17 million parameters. 


During training, the actual input data is augmented with randomly cropped and scaled image 
copies, in order to increase the versatility of the training data. Moreover, for 50% of the 
input bundles, the images were horizontally flipped in order to avoid a bias in the camera 
movement, as this inverts the relative transformations which are to be predicted by the pose 
estimation network. Lastly, the brightness, contrast and saturation are adjusted randomly 
with deviations of up to 20%, while hue is randomly adjusted up to 10%. 


In the scope of this work, no pre-trained model is used and the network is trained from 
scratch. It is trained up to the point where the curve of the training loss flattens. During 
training, by validating the model after each training period, it was noted that the validation 
error does not improve after approximately one third of the total training time. Thus, the 
time used to train the model is calculated up to the training epoch, after which the lowest 
validation error is reached for the first time. For an image size of 384 x 224 pixels, which 
yields the best results while, at the same time, ensuring a stable training (cf. Section 5.6.4.3), 
the proposed network requires around 100 epochs for the full dataset of about 10,000 images 
(cf. Section 5.6.1), resulting in a training time of roughly 24 hours on a NVIDIA Titan X GPU. 
The training time changes proportionally when a different image size is used. 


With the above-mentioned image resolution, the trained model achieves a frame rate of up 
20 FPS on the Titan X GPU without any optimization. This includes the loading of the image 
data into GPU memory as well as the pre-processing of the images, i.e. resizing them to the 
input resolution, and writing the output onto the hard drive. 


* https://github.com/Max-Hermann/SelfSupervisedAerialDepthEstimator 
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5.6 Experiments 


In the following sections, the quantitative and qualitative results from the conducted exper- 
iments are presented. First, an overview of the utilized evaluation datasets and the error 
measures is given in Section 5.6.1 and Section 5.6.2, respectively. Then, the quantitative re- 
sults achieved on street-level imagery are presented in Section 5.6.3.1, together with results 
from literature and ReS*tAC on the same datasets as comparison. This is followed by a pre- 
sentation of the results achieved on aerial imagery in Section 5.6.3.2, which also contains a 
comparison to FaSS-MVS. Lastly, the results of an ablative study with respect to different 
aspects of the approach are presented in Section 5.6.4. All experiments, including training 
and inferencing, were done on a NVIDIA Titan X GPU. 


5.6.1 Evaluation Datasets 


The overall performance of the presented approach for SSDE is quantitatively evaluated on 
three different datasets: on the dataset from the KITTI benchmark (Menze and Geiger 2015) 
comprised of street-level imagery, on a synthetic dataset comprised of aerial imagery with 
accurate ground truth, and on the TMB dataset (cf. Section 4.4.1) containing real-world aerial 
imagery. While the two aerial datasets are used to evaluate and demonstrate the ability of 
the approach for SSDE in a use-case-specific environment and compare its results to that 
of FaSS-MVS, the KITTI benchmark allows a comparison with respect to other approaches, 
such as those presented by Godard et al. (2019) and Zhou et al. (2017) as well as ReS?tAC. In 
case of the KITTI benchmark, ground truth data is captured by a LiDAR sensor providing an 
accurate point cloud from which the ground truth depth maps are deduced. As described in 
Section 4.4.1, in case of the TMB dataset, the offline SFM and MVS toolbox COLMAP (Schén- 
berger and Frahm 2016, Schönberger et al. 2016) is used to generate a reference. Moreover, 
in a qualitatively comparison, depth maps that are predicted by the presented approach on 
the FB dataset (cf. Section 4.4.1) are compared against depth maps estimated by FaSS-MVS. 


While reference data computed by COLMAP is expected and shown to be more accurate than 
that produced by FaSS-MVS and SSDE, it is still deduced from the same modality, namely the 
camera images, and thus the depth maps, which are used as reference, will not be perfectly 
accurate. To overcome this issue, the synthetic dataset is generated by exploiting the capa- 
bilities of modern computer graphics and rendering engines in order to extract an accurate 
ground truth. In particular, the open-source software GTAVisionExport (Johnson-Roberson 
et al. 2017) is adopted to extract color images together with corresponding high-resolution 
ground truth depth maps from the video game Grand Theft Auto V (GTA V). While the level-of- 
detail between real-world and synthetic data greatly differs, the rendering of synthetic data 
is typically more flexible, allowing to manipulate the environment and define a wide range 
of different camera trajectories. Thus, in case of the synthetic dataset, henceforth referred to 
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as the GTA dataset, a variety of scenes with different flight altitudes and flight trajectories, 
such as linear fly-overs or circular flights orbiting a building, as well as different weather 
conditions and daytimes are simulated. The ground truth depth maps are truncated at 100 m. 


For all datasets, around 5000 images were used, splitting them with a 90/10 % ratio into 
training and evaluation set. In case of the two aerial datasets, training was done using a 
fused set of all training images from both datasets, yielding the same results as when training 
on the two datasets separately. 


5.6.2 Error Measures 


Since the depth maps predicted by SSDE are free of a metric scale, only relative error measures 
are used to assess the quality of the predicted depth maps. In particular, the relative L1- 
norm (L1-rel) (cf. Equation (4.14)) between the prediction and the reference as well as the 
completeness (Cpl,) (cf. Equation (4.16)) are used, both introduced in the evaluation of FaSS- 
MVS in Section 4.4.2. In case corresponding estimates exist for all ground truth values, the 
completeness corresponds to the accuracy 6g used by Godard et al. (2019), Zhou et al. (2017) 
and Hermann et al. (2020). This holds for the depth maps predicted by the SSDE approach 
due to their density of 100%. To account for differences in resolution as well as depth scale 
between the predicted depth maps and the reference data, an image resizing to the resolution 
of the reference as well as a median scaling of the predicted depth is used. 


5.6.3 Experimental Results 


In the following, the experimental results achieved on the dataset of the KITTI benchmark 
(Section 5.6.3.1) and on aerial imagery (Section 5.6.3.2) are presented. 


5.6.3.1 Street-Level Imagery 


In order to allow a comparison with respect to other corresponding approaches from literature 
(Godard et al. 2019, Zhou et al. 2017) as well as the approach for Real-Time SGM Stereo 
Optimized for Embedded ARM and CUDA Devices (ReS*tAC), which is presented in Chapter 3 
of this work, an evaluation on the data from the KITTI benchmark (Menze and Geiger 2015) 
is performed. Even though, the nature and characteristics of the data is not relevant for the 
use-case considered in the scope of this work. To train the approach on the KITTI data, the 
raw data of individual sequences is used, allowing to create temporal image bundles from the 
same camera, instead of the actual stereo data. The model was trained with an image size 
of 416 x 128 pixels and a batch size of 4. In the evaluation, the ground truth depth maps 
are truncated at 80m. 
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Table 5.1: Quantitative results achieved by the presented approach for self-supervised single-view depth estimation 
(SSDE) on the dataset of the KITTI benchmark. As evaluation metrics, the L1-rel as well as the complete- 
ness at a threshold of 25 % with respect to the ground truth depth is used. In addition, the results of two 
other approaches for SSDE as well as those achieved by ReS?tAC are listed for comparison. 


Approach Ll-rel Cpl, 25 

(in %) 
Zhou et al. (2017) 0.183 73.4 
Godard et al. (2019) 0.133 84.1 
ReS?tAC 0.043 88.6 


SSDE 0.272 83.9 


Table 5.1 lists the quantitative results achieved by the presented approach for SSDE on the 
dataset of the KITTI benchmark together with the results of two other approaches for SSDE 
as well as those achieved by ReS?tAC. The results reveal that the presented approach achieves 
similar results as related approaches from literature. However, compared to a conventional 
depth estimation approach based on two-view stereo, such as ReS*tAC, the SSDE approaches 
are inferior. Figure 5.4 shows an excerpt of the depth maps predicted by the presented ap- 
proach on data of the KITTI benchmark, together with the corresponding input image and 
ground truth. Since the LiDAR sensor, from which the ground truth is generated, only cov- 
ers the lower part of the camera image, there is a big area in the ground truth depth map for 
which no reference is provided, as depicted with black in Figure 5.4. 


Figure 5.4: Qualitative results of the approach for self-supervised single-view depth estimation on data from the 
KITTI benchmark. Row 1: Reference images. Row 2: Estimated depth maps. Row 3: Ground truth 
depth maps generated from the LiDAR point clouds. The depth is color-coded, going from blue (near) 
via yellow to red (far). Since the scaling between the predicted depth maps and the reference differs, the 
color gradients between the two modalities do not coincide. 
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5.6.3.2 Aerial Imagery 


Table 5.2 lists the quantitative results of the presented approach for SSDE achieved on the two 
aerial image datasets, namely TMB and GTA. While the L1-rel error achieved by SSDE on the 
GTA dataset is lower than that achieved on the TMB dataset, the completeness achieved on 
the TMB dataset is higher than that achieved on the GTA dataset. This irregularity is assumed 
to arise from the higher scene depth of the GTA dataset, since a larger ground truth depth 
will have a greater normalization effect on the L1-rel error. Thus, for comparison between 
different datasets, the completeness is more meaningful and, in turn, reveals that the SSDE 
approach achieves the better results on the TMB dataset. The reason for this could lie in the 
lower scene complexity and variety of the TMB dataset compared to that of the GTA dataset. 
The strength of SSDE approaches on datasets with lower scene complexity is also discussed 
by Madhuanand et al. (2021). 


In addition to the sole evaluation of the results from the SSDE approach with respect to the 
ground truth, a comparison to the results achieved by FaSS-MVS with SGM" °s (cf. Chap- 
ter 4) is done in Table 5.2. As expected, the results of the SSDE approach are inferior to 
those achieved by approaches that use two or more views. In particular, the L1-rel error of 
FaSS-MVS is significantly lower than that of the SSDE approach. In terms of completeness, 
however, the SSDE approach clearly outperforms FaSS-MVS on the TMB dataset. This is due 
to the lower density of the depth maps produced by FaSS-MVS (cf. Table 5.2, Column 6), as 
the post-filtering removes quite a number of inconsistent pixels. However, the smaller differ- 
1.25 and Cpl, o5 of FaSS-MVS compared to the SSDE approach shows that 
the depth maps of FaSS-MVS are generally more accurate. Nonetheless, the use of only one 


ence between Cpl 


view for depth estimation has a number of advantages as further discussed in Section 5.7.1. 


For a qualitative comparison, an excerpt of the predicted depth maps from the two aerial 
datasets is shown in Figure 5.5 and Figure 5.6 together with the input images and reference 
depth maps. As before, the depth is color-coded, going from blue (near) via yellow to red 
(far). Since the scaling between the predicted depth maps and the reference differs, the color 
gradients between the two modalities do not coincide. 


Table 5.2: Quantitative results achieved by the presented approach for SSDE on aerial imagery. As evaluation met- 
rics, the L1-rel as well as the completeness at a threshold of 25 % and 5 % with respect to the ground truth 
depth is used. For comparison, the results of FaSS-MVS with SGM!re optimization are listed. 


Approach Li-rel Cpl,,, Cpl,,, Density 

(in %) (in %) (in %) 

SSDE 0.171 93.5 69.8 100 

TMB FaSS-MVS-SGM!?¢ 0.037 72.0 66.4 70.4 
SSDE 0.081 91.0 57.8 100 

ots FaSS-MVS-SGM! re 0.045 91.4 76.1 93.8 
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Figure 5.5: Qualitative results of the approach for SSDE on aerial imagery from the GTA V (Rows 1-3) and TMB (Rows 4-6) dataset. Rows 1 & 4: Reference images. 
Rows 2 & 5: Estimated depth maps. Rows 3 & 6: Reference depth maps rendered from the game engine in case of the GTA V dataset and generated by 
COLMAP in case of the TMB dataset. The depth is color-coded, going from blue (near) via yellow to red (far). Since the scaling between the predicted 
depth maps and the reference differs, the color gradients between the two modalities do not coincide. (Hermann et al. 2020, Fig. 4.) 
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Figure 5.6: Qualitative results of the approach for SSDE on the use-case-specific FB dataset together with depth 
maps of FaSS-MVS as comparison Column 1: Reference image. Column 2: Depth maps predicted by 
the SSDE approach. Column 3: Depth maps predicted by FaSS-MVS. The depth is color-coded, going 
from blue (near) via yellow to red (far). Since the scaling between the predicted depth maps and the 
reference differs, the color gradients between the two modalities do not coincide. 


5.6.4 Ablation Study 


In the following, the results of an ablative study with respect to different aspects of the 
presented approach for self-supervised single-view depth estimation are presented and 
discussed. Here, the study is focused on the use of different encoder topologies (Sec- 
tion 5.6.4.1) within the U-Net for depth estimation as well as the use of different batch 
sizes (Section 5.6.4.2), resolution of the input images (Section 5.6.4.3) and input bundle size 
(Section 5.6.4.4) during training. 


5.6.4.1 Encoder Topologies 
In the scope of this work, the impact of three different CNN topologies for the use as encoder 


within the U-Net is investigated, namely the VGG-Net16 (Simonyan and Zisserman 2015), 
the ResNet18 (He et al. 2016) and the DenseNet (Huang et al. 2017). Among others, the most 
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Table 5.3: Comparison of different CNN architectures for the use as encoder within the U-Net. Results are computed 
on the TMB dataset with a reduced image size of 192 x 96 pixels. 


Architecture Ll-rel Cpl, ,. 

(in %) 
VGG-Net16 (Simonyan and Zisserman 2015) 0.133 92.7 
ResNet18 (He et al. 2016) 0.120 92.6 
DenseNet (Huang et al. 2017) 0.184 91.2 


important requirements for aCNN topology to be used as an encoder within the presented 
approach is the possibility to extract four layers ofthe CNN to facilitate the skip connections 
between the encoder and the decoder, as well as the capability of an end-to-end training. Due 
to the different configurations and the training speed of the different CNNs, all topologies are 
trained until the error curve during training flattens, in order to allow a fair comparison. This 
results in a long training time, which is why the input resolution is reduced to 192 x 96 pixels 
in the scope of this study. 


The results in Table 5.3 show that all three considered architectures are able to predict mean- 
ingful depth maps on the TMB dataset, with the ResNet18 achieving the best results. At the 
same time, the ResNet18 has the lowest number of parameters, resulting in a faster training, 
which it why it is chosen for the presented SSDE approach. Deeper variants of the ResNet, i.e. 
ResNet32 and ResNet50, were also investigated. However, they have a much greater number 
of parameters that are to be trained, requiring a reduction in the batch size, which, in turn, 
led to a more unstable training. 


5.6.4.2 Batch Size 


On the street-level imagery from the KITTI benchmark a batch size of 4 is chosen, similar 
as by Godard et al. (2019). For the final training configuration on the aerial image datasets, 
however, the batch size is empirically set to 20 (cf. Section 5.5), due to an unstable training 
when smaller batch sizes are used. The instability of the training leads to a divergence of the 
training in early iterations, as the optimization gets stuck in a local minimum. This diver- 
gence is manifested by the network predicting a depth of 0 for a significant amount of pixels 
in the depth maps, from which it does not recover. This is presumably due to the inability 
of the pose estimation network to reliably predict the relative transformations if the batch 
size is too small. This assumption is backed by the fact, that sequences with complex camera 
movements, such as a high amount of rotation without significant translational movement, 
are particularly affected. A possible explanation might lie in the sharing of the encoder be- 
tween the depth and pose estimation network, since both are dependent on each other, which 
could possibly create a dead lock in which the prediction of the depth maps degenerates and 
the training gets stuck. With an increase in batch size, the possibility of processing a higher 
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variety of camera movement within one batch also rises, preventing a divergence in early it- 
erations. However, once reasonable depth maps are predicted by the network, the batch size 
can be reduced. Further experiments have shown, that a batch size larger than 20 does not 
achieve better results. It rather restricts the resolution that can be used for the input images, 
since it increased the memory consumption which is limited in terms of the upper-bound by 
the available memory on the GPU. 


5.6.4.3 Image Size 


The quality of the resulting depth maps is greatly influenced by the image size used as input. 
The image size, on the other hand, is limited by the memory available on the GPU, depending 
on what batch size is used. While the results listed in Table 5.2 are achieved using an input 
resolution of 384 x 224 pixels, an ablative study on the TMB dataset with respect to different 
sizes of the input images has shown, that even a small resolution of 192 x 96 pixels achieves 
reasonable results. In this, it is also found that the time required to train a network is directly 
proportional to the image size used. Training the network with an image size of 19296 pixels 
instead of 384 x 224 pixels is 4X faster, while at the same time reducing the quality of the 
quantitative results by only 2%. A use of even larger image sizes, i.e. 768 x 448 pixels, 
however, is restricted by the limited memory of the GPU. To overcome this restriction, the 
network is pre-trained on a smaller image size, which is followed by a training using a higher 
image resolution but a smaller batch size. This, however, does not lead to superior results, 
while at the same time greatly increasing the training time. 


5.6.4.4 Size of the Input Bundle 


In a final ablative study, it was investigated, whether the approach benefits from using more 
than three input images during training. In this, five input images, with two images to either 
side of the reference image are used. The study, however, shows that this does not lead to any 
superior results, both quantitatively and qualitatively. It is assumed that, with more matching 
images being projected into the reference view, the probability of false positives within the 
process of image sampling and image matching is increased, due to a higher probability of 
pixels that presumably have a high similarity but are sampled from wrong locations in the 
image. This, in turn, hinders the training process. 
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5.7 Discussion 


Before providing some concluding remarks, three different aspects of the presented approach 
for the task of SSDE are discussed in the following. First, the overall results are put into 
context and discussed in Section 5.7.1 with respect to a practical use of the approach. This 
is followed by a discussion on the run-time of the approach in Section 5.7.2. Finally, the 
generalization and self-improving capabilities are discussed in Section 5.7.3, which is very 
important in order to conclude whether the approach is actually ready to be used in practice. 


5.7.1 Overall Results 


Overall, the results presented in Table 5.2 as well as in Figure 5.5 show that the presented 
approach for SSDE is well capable of learning how to predict a depth map from a single aerial 
image. Both quantitative and qualitative results show that the quality of the resulting depth 
maps is high with respect to the reference data. Moreover, the comparisons in Table 5.1 and 
Table 5.2 show that it is also capable to keep up with conventional approaches that estimate 
depth maps from two or more images as well as related work from literature. Presumably 
due to the low image resolution with which the approach is trained and which is also used 
for inference, the model is not able to reconstruct fine structures like vegetation and sharp 
edges. However, the essential geometry of the scene and objects is predicted correctly. 


Furthermore, as already stated, approaches for self-supervised single-view depth estimation 
have a number of advantages over methods that rely on conventional dense image matching. 
For example, since SSDE approaches only rely on one input image, they are not sensitive 
to camera movement at the time of inference and, thus, do not suffer from instabilities, if 
the camera movement degenerates. And during training, in which the approach still greatly 
depends on a good choice of input images, the setup of the input data can be controlled 
more easily, at least if the model is not to be retrained on-the-fly. Another advantage is that 
approaches for SSDE produce fully dense depth maps, which typically also contain estimates 
in areas in which conventional DIM approaches often fail due to a weak texture or occlusions. 
Lastly, since SSDE approaches use only one input image to predict a depth map, they can 
usually operate at a higher frequency compared to monocular MVS approaches, which first 
have to wait until an appropriate set of input images is acquired. 


Nonetheless, there is a significant drawback of the presented approach that is to be discussed 
when looking at the overall results. The training of the approach is done on a great number 
of disjoint image bundles consisting of one reference image and one or more matching im- 
ages to either side of the reference image. To this end, when these bundles are constructed, 
there is no consistency with respect to the baseline between the individual images or a tem- 
poral coherency so that the CNN is actually able to learn a depth scale which is consistent 


146 


5.7 Discussion 


over all predicted depth maps. In fact, the resulting CNN predicts a different depth scale for 
every depth map, making their use in subsequent steps, like obstacle detection and collision 
avoidance or 3D mapping, difficult. 


5.7.2 Run-Time 


A key characteristic of the presented approach for self-supervised single-view depth estima- 
tion, and with it its greatest strength, is its capability to predict a depth map from only one 
aerial image. This allows the SSDE approach to operate at a much higher frequency, since 
there is no need to wait until a bundle of two or more images with an appropriate baseline is 
acquired. This, in turn, also results in a significantly lower latency depending on the time it 
takes to predict a single depth map. As noted in Section 5.5, the approach reaches 20 FPS ona 
NVIDIA Titan X during inference. In this, the inference is done using TensorFlow and with- 
out any optimization regarding run-time. In order to reason about the actual processing rate 
during a productive use of the presented approach, a more optimized processing pipeline was 
implemented using OpenCV, which only covers the prediction of a depth map from a single 
aerial image, given a trained model. Table 5.4 states the average run-time required by the 
more optimized processing pipeline to predict a single depth map on different hardware ar- 
chitectures, i.e. a desktop GPU in form of a NVIDIA RTX 2070 Super and the embedded Tegra 
GPU deployed on the NVIDIA Jetson AGX. The measurements clearly show that the process- 
ing is capable of real-time and low-latency processing, reaching theoretical frame rates of up 
to 250 FPS and 62 FPS on the desktop and embedded hardware, respectively. Note that, even 
though the model is trained with an image size of 384 x 224 pixels, it is possible to process 
images with a larger image size as the convolution kernel of the CNN is iteratively moved 
over the input image. However, the greater the difference in image size between the training 
data and inference data, the less accurate the results are. The suitability of the SSDE approach 
for real-time and low-latency processing suggests that it is a good alternative to conventional 
two-view stereo approaches for the task of obstacle detection and collision avoidance. How- 
ever, as further discussed in Section 6.1, the possibly insecure handling of unknown data by 
learning-based approaches impedes their use for safety-critical applications. 


Table 5.4: Average run-time of the SSDE approach to predict a depth map from a single aerial image of different 
image sizes executed on different hardware architectures. 


Hardware Embedded Run-time (in ms) 

SOC at 384 x 224 pixels at 768 x 448 pixels 
NVIDIA RTX 2070 Super ~4 ~ 12 
NVIDIA Jetson AGX v ~ 16 ~ 45 
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5.7.3 Generalization and Self-Improving Capabilities 


An important criterion of deep-learning-based approaches with respect to the use and de- 
ployment in an actual application is their generalization capabilities, meaning how well they 
can cope with unknown data. In the evaluation of the generalization capabilities of the pre- 
sented SSDE approach, a test setup with two disjoint video sequences (Sequence A & B) is 
used. Ihe two sequences do not have any overlap or common objects but were captured un- 
der similar conditions with respect to camera angle, weather and daytime. In the conducted 
experiment, the CNN is first trained until convergence on Sequence A and then evaluated 
on Sequence B. As depicted in Figure 5.7a at epoch 0, when switching to a new, yet simi- 


lar sequence, the completeness Cpl, ,. drops by around 25 %. This suggests that the trained 


model generalizes rather poorly T data and, in turn, makes the approach rather 
impractical for a use in actual applications. In contrast, however, the presented approach for 
self-supervised training does not require any special reference data during training, which 
makes an on-the-fly fine-tuning with respect to a new sequence possible, resulting in so- 


called self-improving capabilities. 


To evaluate how fast a pre-trained model can be fine-tuned to a new sequence, an experiment 
was conducted in which the pose estimation network was not removed and the weights of 
the model were not fixed after initial training on Sequence A, allowing to continue training 
on Sequence B. As illustrated by the green curve in Figure 5.7a, it takes around 2-4 epochs 
to fine-tune a pre-trained model to a new sequence, reaching the initial completeness again. 
With the configuration described in Section 5.5, one training epoch takes around 2 minutes, 
making the fine-tuning around 10x faster than the initial training from scratch. This ob- 
viously, however, greatly depends on the similarity between the different sequences. The 
fine-tuning on a sequence, which greatly differs from the initial sequence in terms of content 
and environment, is most likely to take longer. 


The curves in Figure 5.7a also suggest, that a network that is fine-tuned to a new sequence, 
tends to forget the initial sequence, as with each epoch the completeness achieved on Se- 
quence A drops. Thus, in a further experiment, a fine-tuning with images from both sequences 
was conducted, showing that the completeness on Sequence A could be maintained while at 
the same time causing a similar increase in completeness on Sequence B (cf. Figure 5.7b). 
Since, in this case, the training dataset is bigger, the time it takes to train for one epoch is, 
however, also increased. Thus, depending on the application, it has to be considered whether 
a fast fine-tuning to the current surrounding is more important, considering that the data 
used for initial training is forgotten. 


To sum up, due to the fact that a self-supervised training does not require special training 
data, the presented approach for the task of self-supervised single-view depth estimation can 
relatively quickly be fine-tuned on new image data, if it is pre-trained on an appropriate 
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Figure 5.7: Experimental results on the generalization and self-improving capabilities ofthe presented approach for 
the task of self-supervised single-view depth estimation. In this, the CNN is first trained until conver- 
gence on Sequence A and then evaluated and fine-tuned on Sequence B, which is similar to Sequence 
A but was not used during the initial training of the CNN. (a) Completeness achieved on Sequence A 
and Sequence B after x epochs of fine-tuning only on Sequence B (Hermann et al. 2020, Fig. 5). (b) 
Completeness achieved on Sequence A and Sequence B after x epochs of fine-tuning on a mixture of 
Sequence A and Sequence B. 


dataset. Here it is important, that the dataset that is used for the initial training is kind of 
similar to the environment in which the approach is later to be used. The method of self- 
improvement, however, also has some disadvantages. When the CNN is to adapt itself to 
the current environment, the pose estimation network cannot be omitted and is needed in 
addition to the network responsible for depth estimation. This results in a larger model. 
Moreover, more than one input image is needed to estimate the depth. And since not only 
the forward pass through the network but also the back-propagation of the training loss, and 
with it the adjustments of the weights, is required in the processing, the run-time is increased. 


5.8 Conclusion 


In conclusion, this chapter presents an approach that allows to flexibly train a convolutional 
neural network for the task of SDE from aerial imagery in a self-supervised manner. The use 
of self-supervised training eliminates the need for appropriate reference depth maps which, 
in the case of aerial imagery, are especially cumbersome and costly to acquire. In contrast, 
the approach can be trained on any image data from a video sequence, depicting a static 
scene, captured by a single moving camera. By formulating the training task as novel view 
synthesis and image reconstruction process, the network learns how to predict a depth map 
from a single input image as well as the relative transformations between the input images, 
as an image can only be correctly synthesized if the scene geometry and the camera setup are 
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learned correctly. While the approach requires three or more input images during training, 
it only requires a single input image for the depth estimation during inference, allowing it to 
also be used for the task of reconstructing scenes with dynamic objects. For the task of depth 
estimation from a single image, only a portion ofthe CNN is required, allowing to reduce the 
size of the model and with it the processing time. 


The experimental results suggest that the approach can possibly be used to facilitate anumber 
of applications, such as real-time 3D mapping, autonomous navigation or scene understand- 
ing. Although the quality of depth maps obtained from a single image are unsurprisingly 
inferior to those obtained by means of DIM, and although the depth maps produced by the 
presented approach lack a consistent and metric scale, depth estimation from a single image 
can operate with a higher frequency than approaches based on MVS. Moreover, SDE does not 
suffer from typical drawbacks, such as occlusions, ambiguous matching in weakly-textured 
areas and inability to reconstruct dynamic objects. In comparison to multi-camera DIM ap- 
proaches, which overcome the inability to reconstruct dynamic objects by a synchronized 
image capture, the depth range supported by SDE approaches is not limited by the confined 
baseline between the cameras, as demonstrated in Figure 5.8. Thus, self-supervised single- 
view depth estimation approaches are well-suited to complement, and in some applications 
even substitute, depth estimation methods relying on DIM and MVS. 


Figure 5.8: Exemplary results of the SSDE approach to estimate depth maps from images with a very large focal 
length. Column 1: Reference image. Column 2: Predicted depth maps. The depth is color-coded, going 
from blue (near) via yellow to red (far). 
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A practical use of deep-learning-based methods, however, is typically hindered by uncer- 
tainties with respect to their operation on unknown image data and generalization. While 
experimental results show, that a pre-trained model generalizes rather poorly to a new and 
yet still similar sequence, resulting in a drop in completeness by about 25 %, it is also demon- 
strated that due to the fact, that the approach does not require any special training data, it 
can quickly be adapted and fine-tuned to the new sequence, reaching the initial completeness 
after only few training epochs. To this end, the self-improving capabilities have only been 
demonstrated using offline learning. In future work, an investigation with respect to on-the- 
fly learning, meaning that the model is retrained during the actual application, is still to be 
done. Furthermore, investigations on the generalization capability to a change in the scene 
for which the CNN has been trained, e.g. other weather or lighting conditions or even modifi- 
cations to scene geometry, are to be conducted. This is particularly important for applications 
which repeatedly visit the same scene, such as 3D change detection or monitoring. 


Lastly, it is very important that the depth maps obtained from single-view depth estimation 
are to be enhanced with a metric scale in order to facilitate applications that rely on accurate 
distance measurements, such as autonomous driving and navigation. Solutions to this are 
manifold. For one, the training data could be enhanced with a few metric depth maps in 
order to regularize the training. Another possibility is to remove the pose estimation network 
and use a calibrated pose estimation approach, such as visual inertial SLAM (Campos et al. 
2021), which would also remedy the drawback of unstable training in case of complex camera 
movement. The integration of an external method, however, would reduce the benefit of an 
end-to-end training. 
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The presented approaches for fast and efficient dense depth estimation based on two-view 
stereo (Chapter 3) and multi-view stereo (Chapter 4) as well as estimating depth from a sin- 
gle view by means of deep learning (Chapter 5) are embedded into a high-level framework 
(Section 1.2), which aims to facilitate emergency forces, in particular first responders, in case 
of disaster relief by aerial reconnaissance using commercial off-the-shelf (COTS) UAVs. In 
this, the focus lies on the facilitation of three high-level applications, namely real-time ob- 
stacle detection and collision avoidance (Section 6.1) for the safe and autonomous use of the 
UAVs, as well as fast and incremental 3D mapping (Section 6.2) and 3D change detection (Sec- 
tion 6.3) for the assessment of the disaster site. Details on how the presented approaches for 
dense depth estimation aid the execution of these high-level applications are presented and 
discussed in the following sections. 


6.1 Reactive Obstacle Detection and Collision 
Avoidance for UAVs Based on Dense Disparity Maps 


As illustrated by Figure 1.1, modern COTS UAVs are equipped with a number of sensors to 
monitor the close vicinity and flight path of the UAV, allowing to detect obstacles and avoid 
collisions which, in turn, is crucial for a safe and autonomous flight. This is particularly 
important, if the UAV is to be flown beyond visual line-of-sight (BVLOS). In addition, it can 
further increase the benefit of the UAV, as the operating personnel can concentrate on the 
data produced by the system rather than flying the UAV itself. For the task of monitoring the 
close surrounding of the UAV, typically ultrasonic as well as visual sensors are used, due to 
their low weight, cost and power consumption. While the strength of ultrasonic sensors lies 
in the reliable detection of obstacles in a very close vicinity, ie. a few tens of centimeters, 
visual stereo cameras allow to monitor the flight path up to a few tens of meters in front of 
the UAV. To detect obstacles and calculate an appropriate evasion maneuver based on data 
from a stereo vision sensor, as illustrated by Figure 6.1, first a dense disparity or depth map 
is to be calculated from which the obstacles can be detected and a flight path around them 
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Real-Time Disparity Estimation Obstacle Detect. & Collision Avoid. 


Figure 6.1: Overview of the discussed approach for reactive obstacle detection and collision avoidance based on 
dense disparity maps. Given an image pair from a stereo camera, J, and Jp, a disparity map U is 
calculated in the first step, for example by using ReS?tAC (Chapter 3). In a subsequent step, the disparity 
map is used to detect obstacles in the flight path and to calculate an appropriate evasion maneuver. (Ruf 
et al. 2018a, Fig. 1.) 


can be calculated. In this, it is of great importance that the whole processing pipeline runs 
in real-time and with low latency to ensure a quick and efficient reaction, thus requiring an 
execution on an on-board processing unit. With ReS?tAC (Chapter 3), an approach for real- 
time disparity estimation by means of two-view stereo on an embedded ARM and CUDA 
device, such as the DJI Manifold 2-G, is presented and can, in turn, be used as the first stage 
within the processing pipeline illustrated in Figure 6.1. To demonstrate the suitability of the 
disparity maps estimated by ReS?tAC for the task of detecting obstacles and calculating a 
collision avoidance maneuver, a simple and yet effective approach is implemented for the 
second stage of the pipeline, which has also been published in the paper of Ruf et al. (2018a). 


In contrast to using the disparity maps to incrementally construct a representative map of 
the environment in which the UAV is operated, allowing for an autonomous exploration of 
unknown areas, the considered approach only operates on a single disparity map to calculate 
the free space in front of the stereo camera and allow for reactive collision avoidance. It relies 
on the transformation of the disparity map U into more simpler representations, the so-called 
U- and V-Maps. They can be interpreted as a top-down- and side-view of the area covered by 
the disparity map and are often used for the task of reactive obstacle detection and collision 
avoidance, e.g. in the work of Labayrade et al. (2002), Liand Ruichek (2014), and Oleynikova et 
al. (2015). In the construction of the U-Map, a histogram of disparity occurrences is calculated 
for every column of U and stored inside a map of size W x |Au|. Here, W represents the 
width of the disparity map and |Au| = Au,,, — Aumin denotes the number of disparities 
inside the disparity range with which U is calculated. The V-Map is calculated analogously 
by computing a histogram for every row of U and storing these histograms inside a map of size 
|Au| x H. Here, |Au| again denotes the number of disparities and H represents the height of 
the disparity map. For every possible obstacle, the corresponding disparities will accumulate 
in the histograms. Thus, the U-Map encodes the depth and width of each object, while the 
V-Map shows the height of each object as well as the ground plane. To further increase 
robustness with respect to small outliers and suppress uncertainties, the U- and V-Maps are 
binarized by applying a threshold filter. Further processing by means of dilation as well 
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as cutting off small disparity values to remove the cluttered background reveals prominent 
objects in the foreground. Here, a kernel of size 7 X 3 and 3x 7 is used for the dilation 
of the U- and V-Map, respectively. Next, the approach of Suzuki and Abe (1985), which is 
implemented in OpenCV, is used to find the contours of each object from which cylindrical 
models are abstracted by drawing ellipses around the contours in the U-Map and rectangles 
around the contours in the V-Map. From these cylindrical models, the depth of the obstacles 
as well as the closest waypoint to the left, right, top or bottom can be calculated. 


Two exemplary results of the U-/V-Map calculation are illustrated in Figure 6.2. While the 
data in the upper example was captured by a hand-held stereo camera, the data ofthe bottom 
example is taken from the stereo data captured by the DJI Matrice 210v2 RTK. The qualitative 
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Figure 6.2: Qualitative results ofthe U- and V-Maps computed for reactive obstacle detection and collision avoidance 
based on the disparity maps of ReS?tAC. The upper example depicts a scene captured by a hand-held 
stereo camera, while the bottom example is taken from the stereo data captured by the DJI Matrice 
210v2 RTK. The images on the left represent the corresponding reference image. The images on the 
right are the disparity maps computed by ReS?tAC, together with the U-Map on top and the V-Map 
to the right. The disparity in the disparity maps is color-coded, going from yellow (high disparity) via 
green to blue (low disparity). The detected obstacles are marked by red ellipses in the U-Maps and with 
blue rectangles in the V-Maps. The slanted line in the V-Maps represents the ground plane. 
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examples show, that the disparity maps produced by ReS?tAC are well-suited for a reactive 
obstacle detection based on the computation of U- and V-Maps, as the obstacles are clearly 
marked by the red ellipses in the U-Maps and the blue rectangles in the V-Maps. In both 
examples, the ground plane is clearly revealed by the slanted line inside the V-Map. In the 
bottom example, the V-Map clearly shows that it is also an option to fly over the detected 
obstacles as the vertical line only goes up to half the height of the disparity map. As dis- 
cussed by Ruf et al. (2018a), the integrity of the algorithm to evade the obstacles and avoid 
collisions was validated by means of hardware-in-the-loop testing in combination with the 
flight simulators of DJI and Microsoft AirSim (Shah et al. 2017). In this, the flight controller 
of the UAV is connected via USB to the PC which is running the simulator allowing the flight 
controller to control the simulation instead of controlling the aircraft itself. 


As previously discussed, safety-critical applications, such as obstacle detection and collision 
avoidance, typically have strict run-time requirements, in the sense that they have to run 
in real-time and with low latency in order to allow a quick reaction. Figure 3.15 illustrates 
that ReS*tAC is well capable of reaching these requirements. It achieves frame rates of over 
30 FPS on the embedded GPU of the NVIDIA Jetson AGX when processing stereo images 
of a size of 640 X 480 pixels and 64 disparities, which is also the case for the examples de- 
picted Figure 6.2. To this end, the considered approach for obstacle detection and collision 
avoidance has not been optimized for parallel execution. Instead, it is executed on the CPU 
of the NVIDIA Jetson AGX, increasing the run-time of the full pipeline by approximately 
10 %, which results in a loss of 2-3 FPS. This is acceptable, especially since large parts of the 
construction of the U- and V-Maps, such as the computation of the disparity histograms and 
the subsequent filtering, are well-suited for massively parallel execution by means of GPGPU 
and can thus be further optimized. The run-times listed in Table 5.4 corresponding to the 
execution of the third approach that is proposed in this work for the task of depth estimation, 
namely the approach for single-view depth estimation, show that it achieves similar frame 
rates as ReS?tAC. However, as already noted in the corresponding discussion, approaches 
based on deep learning are of limited suitability for safety-critical applications, due to their 
possibly unreliable handling of unknown data. Nonetheless, the research field of explainable 
Al has grown significantly in the last years (Barredo Arrieta et al. 2020), further increasing 
the reliability of deep-learning-based approaches in safety-critical environments. 


6.2 Rapid 3D Mapping Based on Dense Depth Maps 


While the first part of the proposed and embedding framework (Section 1.2) focuses on the 
actual operation of the UAV, the second part aims to facilitate the operator, which in case of 
the considered use-case are emergency forces, in the rapid assessment of a geographical area 
by means of 3D mapping. Given the image data from the payload camera, a fast and online 
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Camera Pose Estimation Dense Depth Estimation Depth Map Fusion 


Figure 6.3: Overview of a generic pipeline for online 3D reconstruction and model generation from data of a monoc- 
ular image sequence. The first stage performs an estimation of the camera trajectory together with a 
sparse mapping of the scene by using a visual SLAM (V-SLAM) algorithm. In this, it is assumed that 
the intrinsic matrix of the camera that is capturing the image sequence is known. In the second stage, 
dense depth maps are computed by means of monocular multi-view stereo from a bundle of input im- 
ages with corresponding camera poses. The estimated depth maps are then incrementally fused into a 
three-dimensional model in the third stage. 


3D mapping is to be performed, meaning that the 3D model is created while the image data is 
acquired or at least received by the ground control station (GCS). If an appropriate data-link 
between the UAV and the GCS is established, the processing can be done while the UAV is 
flying over the area of interest. Since the use-case aims at fast and online processing in order 
to aid first responders, software toolboxes, such as COLMAP (Schönberger and Frahm 2016, 
Schönberger et al. 2016), that are aimed for offline 3D reconstruction are not an option for 
this use-case, due to their higher run-time. Nonetheless, highly accurate 3D models produced 
by such toolboxes are still helpful for an offline assessment and their generation could be 
triggered anyway, once all image data is acquired. 


Conventional approaches for fast and online 3D reconstruction and model generation based 
on monocular image data are typically divided into three subsequent steps, namely extrin- 
sic camera pose estimation, dense depth estimation and depth map fusion (cf. Figure 6.3). 
Given data from an image sequence captured by a single moving camera which has been 
calibrated a priori, meaning that the intrinsic camera matrix K is known, the first step per- 
forms an estimation of the camera trajectory together with a sparse mapping of the scene 
by using a visual SLAM (V-SLAM) algorithm. In the second stage, dense depth maps are 
computed by means of monocular multi-view stereo from a bundle of input images with cor- 
responding camera poses. The estimated depth maps are then incrementally fused into a 
three-dimensional model in the third stage. 


With FaSS-MVS (Chapter 4), this work presents an approach for an efficient processing of 
the computationally most expensive part of the pipeline, i.e. the dense depth estimation. To 
demonstrate that FaSS-MVS is capable of facilitating rapid 3D mapping, an approach based 
on the processing pipeline illustrated by Figure 6.3 was implemented and evaluated. In this, 
the state-of-the-art and open-source SLAM system ORB-SLAM2 (Mur-Artal and Tardös 2017) 
is used for sparse mapping and camera trajectory estimation in the first stage. While ORB- 
SLAM2 continuously tracks the movement of the camera for each frame of the image se- 
quence, it also defines a subset of these input images as so-called key-frames. These key- 
frames are used to create a sparse map of the environment by matching and triangulating the 
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ORB features (Rublee et al. 2011) between the key-frames. In doing so, the extrinsic camera 
poses ofthe key-frames are simultaneously estimated and updated with every new key-frame 
that is added to the list. For the approach of fast 3D mapping, the most recent group of five 
key-frames, consisting ofthe input images together with the corresponding camera poses, are 
repeatedly selected and passed on to the second stage. By relying on the internal logic of ORB- 
SLAM2 to select the key-frames, it is ensured that only images with sufficiently novel content 
are used for further processing. In the second stage, FaSS-MVS is used to estimate a dense 
depth map for the middle image of each input bundle, and the resulting depth maps can then 
be fused into single 3D model in the third stage. For the depth map fusion, the open-source 
system ElasticFusion, presented by Whelan et al. (2015) and Whelan et al. (2016), is used. It 
relies on both geometric and photometric cues to register and fuse the input data into the 
already established model and thus incrementally constructs a 3D model of the covered area. 
While it uses a point-to-plane error measure as a geometric cue, it constructs a photometric 
cost by projecting the model into the view of the new input frame and matches it with the 
corresponding input image. Just as FaSS-MVS, ElasticFusion is optimized and accelerated for 
real-time processing by utilizing general-purpose computation on a GPU (GPGPU). Available 
geographical information provided by the on-board GNSS of the UAV can either be used to 
geographically reference the input images and thus propagate the geographical information 
through the pipeline, or it can be used to calculate an appropriate transformation which is ap- 
plied on the resulting model in a post-processing. In either way, the geographical referencing 
of the processed data is vital for the application in focus. 


This approach for fast dense 3D mapping from monocular image data captured by COTS 
UAVs was thoroughly evaluated and published in the work of Hermann et al. (2021b). In the 
following, the findings presented by Hermann et al. (2021b) will be discussed with respect 
to the considered use-case of facilitating first responders by providing rapid 3D mapping of 
the operational area. The evaluation was conducted on a synthetic dataset extracted from the 
video game Grand Theft Auto V (GTA V) (cf. Section 5.6.1) as well as on the TMB dataset 
(cf. Section 4.4.1) consisting of real-world data captured by a COTS UAV. The accuracy of the 
resulting 3D model is calculated by computing the root-mean-square error (RMSE) between 
the points of the estimated 3D model and the ground truth, with both entities being aligned 
in such a way that the distance between each point in the estimated 3D model and the closest 


Table 6.1: Quantitative results of the 3D models computed by the discussed pipeline for fast 3D mapping. The results 
are computed by evaluation on a synthetic dataset (GTA) and a real-world dataset (TMB). (Hermann et al. 
2021b, Tab. 4.) 


GTA TMB 
RMSE 0.774 0.141 


(in m) +0.616 +0.033 
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counterpart in the ground truth is minimal. The qualitative results depicted in Figure 6.4 sug- 
gest a high quality of the resulting point clouds. This is supported by the quantitative results 
listed in Table 6.1, with a RMSE of less than 1 m on both datasets. The RMSE on the GTA data- 
set is much higher than that achieved on the TMB dataset. However, the fairly high standard 
deviation of 0.616 m suggests that there are great quality differences between the individual 
sequences. The qualitative excerpt in Figure 6.4 shows, that even though the considered ap- 
proach has difficulties in reconstructing fine-grained structures and high level-of-detail, it is 
well capable to reconstruct larger structures and the general geometry of the scene. 


However, the evaluation also reveals that the approach is not capable of successfully pro- 
cessing all input sequences but instead failed on 7.85% of the synthetic image sequences. 
The majority of these failures arise from errors in the registration and fusion of depth maps 


Figure 6.4: Exemplary results of the 3D models computed by the discussed pipeline for online 3D reconstruction. 
Row 1: Exemplary input images from the datasets. Rows 2 & 3: Images of the colored point clouds 
computed from the image sequences corresponding to the images in Row 1. Row 4: Visual comparison 
between the estimated point cloud and the reference model. The reference model is depicted in blue, 
while the estimated point cloud is depicted in yellow. In some areas, the estimated data lies beneath the 
reference data and is thus occluded. (Hermann et al. 2021b, Fig. 3.) 
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into a single model when covering challenging scenes. Single and isolated misalignments are 
compensated by the outlier filtering of ElasticFusion, as there will still be enough correctly 
registered estimates that reduce the confidence score of the outliers. If a number of depth 
maps are repeatedly and incorrectly registered, however, the resulting model becomes very 
noisy, making it difficult to fuse subsequent depth maps and, in turn, resulting in a failure of 
reconstruction. The other cause of failure lies in the camera movement and the flight trajec- 
tory of the UAV. If the camera is moved too rapidly, with high amount of rotation, the feature 
tracking within the SLAM algorithm fails, requiring a re-initialization of the trajectory esti- 
mation. While this re-initialization typically does not pose a major problem, it still leads to a 
stop in the estimation of the depth maps for a certain period of time, making the registration of 
the new depth maps difficult due to a smaller or missing overlap. This, however, can be com- 
pensated if the pose estimation is regularized by the augmentation of the input images with 
data from the GNSS and IMU substituting the visual SLAM system by a visual-inertial SLAM 
(VI-SLAM) system, such as ORB-SLAM3 (Campos et al. 2021). In addition, if the UAV is not 
operated manually but instead automatically follows a predefined and specifically planned 
flight path, rapid camera movements are typically avoided as the flight controller of modern 
COTS UAVs smoothly updates the camera position based on the predefined configuration. 


Despite these weaknesses and instabilities, the biggest strength of the considered approach 
lies in its run-time. While ORB-SLAM2 and ElasticFusion are capable of running in real-time, 
with ORB-SLAM2 running on a commodity CPU and ElasticFusion on a GPU, FaSS-MVS by 
itself is not capable of real-time processing, as already discussed in Section 4.5.3. But since, 
FaSS-MVS is not executed for every frame of the input stream, the whole pipeline is still 
capable of keeping up with an input video with 30 FPS, making it capable of online process- 
ing. Furthermore, as previously discussed in Section 4.5.3, the flight speed and configuration 
have a great impact on the processing speed. The experiments conducted by Hermann et 
al. (2021b) show, that in case of the GTA dataset, the average subsampling ratio of the con- 
sidered approach is around 2%, meaning that for only 2% of the input images depth maps 
were computed. 


In summary, it has been shown that with a processing pipeline as presented in Figure 6.3, 
the approach for Fast Multi-View Stereo with Surface-Aware Semi-Global Matching (FaSS- 
MVS) presented in this work is well capable for rapid 3D mapping of a geographical area 
by aerial reconnaissance with COTS UAVs. In doing so, the generated 3D map can facilitate 
emergency forces in a quick assessment of the situation, either by means of providing a large- 
scale 3D model or by providing an overview with the help of up-to-date orthographic maps, 
as illustrated in Row 2 of Figure 6.4. 


What still needs to be discussed, however, is whether the presented approach for self- 
supervised single-view depth estimation (SSDE) from aerial imagery can be used within the 
dense depth estimation stage of the considered pipeline for rapid 3D mapping. The greatest 
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advantage of SSDE over conventional monocular MVS approaches is its high processing 
speed, the ability to predict depth maps from images containing dynamic objects, and the 
ability to predict depth from images with a very large focal length. However, its lack of a 
metric scale, temporal inconsistency, and its insecure handling of unknown objects or scenes 
as well as the small image size of the resulting depth maps make the presented SSDE approach 
unfit to be used by itself within the considered pipeline for rapid 3D mapping. The different 
depth scales and the temporal inconsistency of the depth maps as well as possibly wrong 
predictions due to a lack of generalization are likely to induce an error in the registration of 
newly estimated data into the 3D model which, in turn, will lead to a failure in the processes 
of depth map fusion. The approach of SSDE is rather suited for a complementary use in 
combination with amonocular MVS approach, such as FaSS-MVS. Due to its high processing 
speed, SSDE can easily be executed in parallel to FaSS-MVS. Given the depth maps from 
both approaches, depicting the same scene from the same viewpoint, the multi-view depth 
map, on the one hand, can then be used to adjust the scale of the single-view depth map. The 
single-view depth map, in turn, allows to fill in image regions in the multi-view depth map 
that are missing due to occlusions or due to the existence of dynamic objects. 


Lastly, for the task of 3D mapping, image-based techniques can evidently only be used with 
appropriate visibility and lighting conditions. Thus, the complementary use of active sens- 
ing, such as LiDAR, is to be considered in future work, especially since the technological 
advancements allow to equip more and more COTS UAVs with such sensors. 


6.3 Change Detection in 3D Model Data 


The third application considered in the proposed framework aims at supporting the emer- 
gency forces by means of 3D change detection, which can be of great importance in facilitat- 
ing damage assessment or an updated route and mission planning for the emergency forces. 
According to the taxonomy introduced by Qin et al. (2016), 3D change detection method- 
ologies can be divided into two categories, namely “difference-based” and “projection-based” 
methodologies. As the name suggests, methodologies in the first category used a simple dif- 
ference metric between two 3D models to detect changes. In this, 2.5D model data, such as 
depth maps or digital elevation models, as well as complete 3D models can be used. Change 
detection based on 2.5D data is performed by simply calculating the point-wise difference 
between the two models along the depth or elevation direction. While this is a very simple 
methodology that scales well to large-scale models, it is, however, very sensitive to errors in 
the co-registration of the two models. If the model data is given in the form of complete 3D 
models, meaning that the models also contain convex structures, change detection based on a 
simple depth or height difference is not appropriate, since this would not consider structures 
that are occluded by overhanging structures. To circumvent this, Qin et al. (2016) propose 
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Figure 6.5: Overview of the processing pipeline for rapid 3D change detection based on depth difference. Up-to-date 
image data is used to compute depth maps (D}) of a scene for which the changes are to be determined. 
These are compared against reference depth maps (Ds) which are synthetically rendered from a reference 
model with respect to which the 3D changes are to be detected. For this, it is important that the image 
data is localized and co-registered with respect to the 3D model data. 


change detection based on the minimum Euclidean distance between the two models. This 
approach is more robust to small errors in the co-registration, but at the same time requires 
the finding of point correspondences between the models. 


On the other hand, methods that fall into the second category of the taxonomy introduced 
by Qin et al. (2016) use data from an existing 3D model, against which the change is to be 
detected, to project one or more neighboring views into a reference view and match the 
projected images to the reference image in order to determine inconsistencies. In this, it is 
assumed that, if a change has occurred with respect to the reference model, the correspond- 
ing image regions between the projected and the reference view does not coincide anymore, 
introducing inconsistencies which, in turn, indicate a change in the scene structure. The 
advantage with projection-based methodologies is that, in order to perform change detec- 
tion, no explicit depth estimation or 3D reconstruction is required. According to Qin et al. 
(2016), these approaches are particularly efficient when the existing 3D model as well as 
the co-registration between the image data and the model are highly accurate. At the same 
time, however, such approaches have difficulties when the 3D model has a significantly lower 
level-of-detail than the image data or if the changes occur in homogeneous image regions. 


For the 3D change detection from aerial imagery, this work considers a processing pipeline 
which is illustrated by Figure 6.5 and was published, together with preliminary results, by 
Ruf and Schuchert (2016). Again, a very strong focus is put on the run-time of the considered 
approach in order to achieve online processing. Thus, the change detection is performed only 
by comparison of two depth maps and thus falls into the first category of the taxonomy in- 
troduced by Qin et al. (2016). Here, one depth map, i.e. D;, is computed from the image data 
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captured by a UAV flying over an area of interest, while the other one, i.e. Ds, is synthetically 
rendered from the reference 3D model against which the change is to be determined. Even 
though, in the scope of the proposed framework (Section 1.2), it is assumed that D; is com- 
puted either by FaSS-MVS (Chapter 4) or the approach for SSDE (Chapter 5), it can also be 
computed by an approach, such as ReS?tAC (Chapter 3), that relies on two-view stereo, as the 
preliminary results in Ruf and Schuchert (2016) demonstrate. The rendering of the synthetic 
depth map D, can efficiently be implemented by relying on the real-time rendering pipeline 
of OpenGL. By considering the depth information alone, the methodology does not rely on 
the comparison of texture information between the reference model and the image data. This 
allows to even perform change detection with respect to data which has been acquired long 
time ago and possibly at a different time of the year, or which originates from different sensor 
modalities. However, this approach requires an accurate localization of the image data with 
respect to the reference 3D model. Even though the pipeline presented in Figure 6.5 sug- 
gests that the co-registration between the image and model data is to be performed before 
the generation of Dj, it can also be executed after the image-based depth estimation, which, 
in turn, allows to also consider D; in the registration process. Experiments with respect to 
both options for the task of automatic co-registration of aerial image data with respect to 
a 3D model were conducted and published by Schmitz et al. (2019). In both cases, the co- 
registration is required in order to render the 3D model in such a way that D; depicts the 
same perspective as Dj. 


In the process of change detection, a simple difference metric is considered, namely the pixel- 
wise absolute difference between the logarithmic, normalized depth according to: 


1 ‚if |log (Ds(p)) - log (Di(p))| > 7 


0 „otherwise. 


M(p) = | (6.1) 


Here, M denotes a binary change mask, marking changed image regions with 1, while Ds and 
Di represent the normalized depth maps. The threshold, for which the difference between 
the two depth maps is considered as change, is defined by t. With the use of the logarithmic 
depth, variations in regions with a large depth have less influence on the change detection. 
This is necessary, as the image-based depth estimation typically becomes less accurate with 
increasing distance. In order to account for noise, the change mask M is additionally pro- 
cessed by an appearance-based Gaussian smoothing, analogous to that introduced by Equa- 
tion (4.11). In order to additionally store information about whether a structure has been 
added or removed with respect to the existing model, Equation (6.1) is extended to: 


1 „if min flog (210) — log (Ds(p)); of <-T 
M(p)= 4-1 , if min flog (Ds(p)) — log (210). o} <r (6.2) 


0 „otherwise. 


163 


6 Discussion on the Importance of the Presented Approaches for Possible High-Level Applications 


(a) Reference Scene (b) Matching Scene 


Figure 6.6: Illustration of the reference and matching scene used for the qualitative evaluation of the considered 
approach for rapid 3D change detection. Here, images of an industrial hall were captured on two dif- 
ferent days. The visual appearance between the scenes greatly differs, which is due to different weather 
conditions. Between the two scenes, the cars were differently positioned and the blue lifting platform, 
visible in in the reference scene, was removed. The co-registration ofthe data was done manually. 


By not considering the absolute difference, it is now possible to determine whether a posi- 
tive or negative change has occurred. Since the logarithm is negative for the range [0,1], a 
negative threshold must be used. 


For a qualitative evaluation and test of the considered approach for change detection from 
aerial imagery, images of an industrial hall were captured by a DJI Phantom 3 Advanced on 
two different days, as illustrated in Figure 6.6. This figure clearly reveals that the visual ap- 
pearance between the scenes greatly differs, which is due to different weather conditions. The 
matching scene is significantly brighter and more colorful than the reference scene. Between 
the two scenes, the cars were differently positioned and the blue lifting platform, visible in in 
the reference scene, was removed. The co-registration of the data was done manually using 
the CloudCompare’ software. 


In Figure 6.7, qualitative results of the considered approach for 3D change detection are il- 
lustrated. Here, the image-based depth maps D; were generated based on images from the 
input sequence. For the reference depth maps Dg, a reference model was first calculated us- 
ing COLMAP (Schönberger and Frahm 2016, Schönberger et al. 2016) from which the depth 
maps were synthetically rendered. Using the formula explained in Equation (6.2), change 
masks were computed based on D; and Ds. In the illustration of the depth maps in Fig- 
ure 6.7, positive changes (structures that are not present in the reference model) are marked 
with blue and negative changes (structures that have been removed) are marked with red. A 
qualitative evaluation reveals that the two vehicles in the background are correctly identified 
and marked as positive changes. In case of negative changes, the change masks show many 


* http://www.cloudcompare.org/ 
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false positives. In addition, the real changes, such as the removed car or the removed lift, have 
not been detected. This can be attributed to a possibly different scaling between reference 
and matching depth maps, which can be deduced by the different coloring between D; and 
Dı. As already discussed by Qin et al. (2016), this form of 3D change detection, i.e. based 
on the point-wise difference along the depth direction, is particularly prone to errors in the 
co-registration. However, a major advantage of the methodology is that it does not rely on a 
texture comparison between the reference and matching data. Thus, different weather con- 
ditions, times of day, or seasons typically do not have an influence on the change detection. 


The run-time of the actual change detection is negligible compared to the image-based depth 
estimation based on monocular MVS. When using approaches that achieve higher processing 
rates, such as ReS?tAC or SSDE, the change detection will increase the latency of the complete 
processing pipeline. However, since it is only a computation based on depth difference, it is 
not very computationally expensive. As already discussed in the previous chapter on rapid 
3D mapping, the lack of scale and the difficulty in generalization of an approach based on 
SSDE inhibit its sole use within the presented pipeline. Nonetheless, a complementary use in 
order to enhance the depth maps computed by conventional methods can be very beneficial. 


Lastly, just as in the case of the discussion on obstacle detection and avoidance, this is a 
very simple approach for 3D change detection that still has quite a number of weaknesses 
as illustrated in Figure 6.7. It is to be further improved in order to increase its performance, 
for example by using a more sophisticated difference measure, or by introducing temporal 
consistency in order to reduce false positives, as well as enhancing it with data driven methods 
relying on CNNs. Nonetheless, the results demonstrate the feasibility of using approaches for 
fast image-based depth estimation in order to facilitate rapid 3D change detection, which, in 
turn, aids emergency forces in the assessment of the situation in case of disaster relief. 
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This work addresses the use of commercial off-the-shelf (COTS) UAVs by emergency forces, in 
particular first responders, to facilitate disaster relief or search-and-rescue (SAR) missions by 
aerial reconnaissance. It proposes a framework that aims at facilitating the rapid assessment 
of a disaster site based on aerial image data captured by UAVs, which are ideally operated 
fully autonomously in conjunction with a ground control station (GCS). This framework is 
divided into two parts: (i) The first part aims at increasing the safe and autonomous use of 
the UAV by implementing an approach for real-time and low-latency two-view stereo depth 
estimation that is executed on a high-performance on-board processing unit (PU). It esti- 
mates disparity maps from the on-board stereo vision sensor, which serve as an input to an 
algorithm for reactive obstacle detection and collision avoidance. (ii) The second part of the 
proposed framework addresses the use of image data from the visual payload camera to per- 
form fast multi-view stereo (MVS) depth estimation and single-view depth estimation (SDE) 
to allow rapid 3D mapping and 3D change detection. Here, it is assumed that the image data 
is streamed down to the GCS, equipped with more powerful hardware in order to allow for 
high resolution online processing. The approaches for two-view, multi-view and single-view 
depth estimation are individually and thoroughly evaluated on appropriate datasets, clearly 
demonstrating their performance. The feasibility of their use for the corresponding applica- 
tions within the proposed framework is shown and discussed by embedding them into greater 
processing pipelines and performing experiments on use-case-specific datasets. 


With the three different approaches for fast depth estimation, this work presents a number 
of scientific contributions. The key contributions of this work are summarized as follows: 


e Implementation-specific optimization of the SGM algorithm for real-time two- 
view stereo on embedded ARM and CUDA devices: The so-called Semi-Global 
Matching (SGM) algorithm is one of the most widely used algorithms for embedded 
and real-time dense depth estimation based on two-view stereo images. This is due 
to its good trade-off between accuracy and computational complexity. At the same 
time, when it comes to embedded high-performance computing, the use of FPGA- 
based systems, for which the development is often cumbersome and costly, was for 
a long time the only option. In recent years, however, more and more COTS UAVs 
can be equipped with SOCs that provide embedded GPUs, such as the NVIDIA Jetson 
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series, which, in turn, can be utilized by the CUDA-API for massively parallel process- 
ing. With ReS?tAC, this work proposes an approach in which the SGM algorithm has 
been optimized for embedded CUDA GPUs, such as the NVIDIA Tegra, by utilizing the 
CUDA-API, as well as for embedded ARM CPUs by utilizing the NEON intrinsics for 
vectorized SIMD processing. The optimization of the SGM algorithm with NEON in- 
trinsics for SIMD processing on embedded ARM CPUs is the first of its kind. It is shown, 
that ReS?tAC achieves state-of-the-art accuracies on public stereo benchmarks, while 
the CUDA-based implementation reaches real-time performance on modern embedded 
SOCs. Furthermore, it is demonstrated that it can be used for real-time stereo depth 
estimation on COTS UAVs that are equipped with appropriate hardware. Even though 
GPU-based SOCs have a worse FPS/W ratio than those that rely on FPGA technology, 
it is shown that in case of rotor-based UAVs, their power consumption is negligible 
compared to that of the rotors during flight. 


A hierarchical approach for fast dense depth estimation based on monocular 
multi-view stereo with a surface-aware improvement of the SGM algorithm: 
For high resolution and fast dense depth estimation that facilitates rapid 3D mapping 
and 3D change detection, this work proposes the so-called Fast Multi-View Stereo with 
Surface-Aware Semi-Global Matching (FaSS-MVS) approach. It relies on a hierarchi- 
cal processing to efficiently perform dense depth estimation by means of monocular 
MVS from aerial imagery. In the depth estimation, a plane-sweep algorithm for dense 
multi-image matching and generation of depth hypotheses is used, which is followed 
by an adapted and improved variant of the SGM algorithm to perform a surface-aware 
regularization and depth map computation. The surface-aware regularization reduces 
the fronto-parallel bias of the SGM algorithm and allows to account for slanted sur- 
faces which are particularly prominent in oblique aerial imagery. The quantitative and 
qualitative evaluation show that the depth maps of FaSS-MVS achieve a high quality 
with an L1 error of only 1% of the maximum depth. As the name suggests, the focus 
of FaSS-MVS lies in a fast and yet accurate computation that allows for online pro- 
cessing, meaning that it should be fast enough to keep up with the frame rate of the 
input data, while not being subjected to hard run-time or latency constraints as it is 
not intended for safety-critical applications. Apart from the methodical optimization, 
FaSS-MVS has also been optimized for massively parallel processing on a GPU, allow- 
ing to reach a processing rate of 1-2 Hz, depending on the configuration of the input 
data. This is enough to keep up with the input data stream, since in case of monocular 
MVS, the depth estimation is not performed for each input image. 


Investigation of the use of self-supervised learning for the task of single-view 
depth estimation from aerial imagery: A major disadvantage of the approaches 
presented with ReS*tAC and FaSS-MVS, which rely on dense image matching (DIM), is 
that they require two or more images for the task of depth estimation. This, however, 
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limits their processing rate as well as their ability to reconstruct dynamic objects, in 
case of monocular MVS. Furthermore, if depth estimation is performed based on im- 
ages ofa stereo camera, the depth range is limited by the baseline between the cameras. 
Motivated by the ability of humans to reason about depth from a single input image 
based on their empirical knowledge, a data-driven approach for single-view depth esti- 
mation (SDE) based on CNNs is investigated. In particular, an approach that is trained 
by means of self-supervision in order to not depend on specific training data, which is 
typically cumbersome and costly to acquire. By formulating the training task as novel 
view synthesis and image reconstruction problem, the approach can be trained on any 
image data from a video sequence that depicts a static scene and is captured by a single 
moving camera. The independence of specific training data, i.e. reference depth maps, 
makes the deployment of a self-supervised single-view depth estimation approach more 
flexible with respect to the practical use. The experimental results suggest that the ap- 
proach can possibly be used to facilitate a number of applications, such as real-time 
3D mapping, autonomous navigation or scene understanding. However, the lack of 
a metric scale and temporal inconsistency in the resulting depth maps as well as the 
insecure handling of unknown objects or scenes by learning-based approaches makes 
the presented SSDE approach rather unfit to be used by itself. It is rather suggested 
to use the approach for single-view depth estimation complementary to an approach 
based on DIM, such as ReS?tAC or FaSS-MVS. Due to its high processing speed, SSDE 
can easily be executed in parallel. Given the depth maps from both approaches depict- 
ing the same scene from the same viewpoint, the DIM depth map, on the one hand, 
can be used to adjust the scale of the single-view depth map, which, in turn, allows to 
fill image regions that are missing in the DIM depth map due to occlusions or due to 
dynamic objects. 


Demonstration of the use of dense depth maps estimated by means of two- 
view stereo for the task of on-board obstacle detection and collision avoidance 
on COTS UAVs: To demonstrate the suitability of ReS?tAC for real-time and low- 
latency obstacle detection and collision avoidance, it is combined with an approach 
that transforms the disparity or depth maps into a top-down- and side-view of the area 
in front of the UAV, represented by the so-called U- and V-Maps. From these U- and 
V-Maps, cylindrical models of the obstacles can be extracted by subsequent filtering, 
contour detection, and fitting of ellipses and rectangles. Qualitative experiments show 
that obstacle detection and collision avoidance based on the U- and V-Map is simplistic 
and yet very efficient in the detection of obstacles. Its computationally low complexity 
allows, in conjunction with ReS*tAC, for real-time and low-latency processing on board 
the UAV. Yet, its lack of temporal consistency makes the approach sensitive to false 
positive detection and possible noise inside the disparity or depth maps. 
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« Demonstration of the use of dense depth maps estimated by means of multi- 
view stereo for the task of rapid 3D mapping and 3D change detection: The 
second part of the proposed framework aims at rapid 3D mapping and 3D change de- 
tection based on dense depth maps computed by FaSS-MVS. The separate evaluation 
of FaSS-MVS has shown that the estimated depth maps have a high quality and that it 
can reach a processing rate of 1-2 Hz, depending on the configuration of the input data. 
To demonstrate the use of FaSS-MVS for rapid 3D mapping from aerial imagery, it is 
integrated into a bigger processing pipeline that relies on open-source real-time visual 
SLAM system for the estimation of the camera trajectory as well as on an open-source 
system for real-time depth map fusion and 3D reconstruction. A thorough evaluation, 
which is published in other work, shows that for a great majority of the investigated 
image sequences, this pipeline produces accurate 3D models with a mean RMSE of less 
than 1 m and that it is capable of running simultaneously to the input stream and keep- 
ing up with an input frame rate of 30 FPS. For a small percentage of the investigated 
sequences, however, the considered approach fails due to rapid camera movement with 
a high degree of rotation, which typically arises if the UAV is operated manually. This 
leads to errors in the depth estimation which, in turn, result in a misalignment and 
wrong registration in the fusion of the depth maps. Thus, it is important to make use 
of an autonomous flight based on predefined waypoints enforcing a smoother camera 
movement. 


To investigate the ability to use FaSS-MVS for rapid 3D change detection, a pro- 
cessing pipeline is evaluated in which the 3D change is determined by comparison of 
two depth maps, namely a depth map that is estimated from the image data of an in- 
put stream, and a depth map which is synthetically rendered from a reference model 
against which the change is to be detected. The qualitative experiments show that this 
approach is well capable of detecting changes with respect to a 3D model. Yet, it is 
very sensitive to an inaccurate co-registration of the image data and the 3D model as 
well as a difference in the depth scale of the two depth maps, leading to a number of 
false-positive and false-negative detections. Detecting 3D changes based on depth dif- 
ference, however, is computationally efficient and is thus particularly suited for fast 
and rapid processing. 


The focus of this work lies in three different methodical approaches for fast dense depth 
estimation based on image data captured by COTS UAVs. In addition, these approaches are 
embedded into a bigger framework focusing on three high-level applications. Consequently, 
the future work addresses both methodical aspects with respect to the different approaches, 
and conceptional aspects with respect to the proposed framework. While some of the future 
work that address the different methodical aspects is already discussed in the corresponding 
chapters, the most important and overlapping aspects are discussed below: 


170 


7 Conclusions and Future Work 


e Further optimization with respect to run-time and required hardware re- 
sources: The approaches presented in this work are intended for real-time and online 
processing and have been optimized accordingly. While they already reach the desired 
performance, thorough evaluation has shown that they still require more optimization 
with respect to run-time and the hardware resources needed. ReS2tAC, for example, 
does not reach the same throughput as comparable approaches from literature. FaSS- 
MVS, on the other hand, requires a large amount of hardware resources which also 
has a negative effect on the run-time. And the maximum frame-size of the approach 
for SSDE is limited due to a large batch size that is required for a stable training. 
Thus, further implementation-specific optimization is required to further increase the 
performance of the presented approaches for dense depth estimation with respect to 
processing speed, which, in turn, increases their benefit. 


« Benchmarking and use-case-specific evaluation: A great difficulty in the evalua- 
tion of the approaches presented in this work is the lack of public benchmark datasets 
for the task of fast and online depth estimation, which in turn inhibits a direct com- 
parison to similar approaches from literature. In particular, appropriate datasets for 
the task of fast monocular multi-view stereo from aerial imagery, as tackled by FaSS- 
MVS, are missing. Furthermore, the availability of datasets that focus on the use-case 
of obstacle detection and collision avoidance as well as fast 3D mapping and 3D change 
detection is not given. This raises the need for further use-case-specific evaluation and 
benchmarking in future work. In this, the acquisition and publishing of appropriate 
datasets with high-quality ground truth is essential in order to facilitate a benchmark- 
ing with respect to similar approaches from literature. 


« Extensive investigation of learning-based approaches with respect to gener- 
alization and practical use: In recent years, the performance of techniques that 
rely on deep learning has increased significantly. For the task of depth estimation, 
deep-learning-based approaches stand at the top of the majority of public benchmarks. 
Thus, their use seems very promising with respect to highly accurate and fast process- 
ing. Nonetheless, their performance “in the wild”, meaning outside of carefully crafted 
datasets, is still an open issue and needs further investigation. In particular, investiga- 
tions on whether such approaches are suited for a practical use and how they general- 
ize to unknown scenes and objects are to be conducted. As discussed, the approach for 
self-supervised single-view depth estimation, for example, has certain self-improving 
capabilities, which need further evaluation with respect to more variety and novelty in 
the considered scenes. 


e Complementary use of DIM and SDE approaches: The experiments conducted in 
the scope of this work suggest a complementary use of conventional approaches that 
rely on DIM as well as learning-based approaches such as the one for self-supervised 
single-view depth estimation. This would allow to make use of the individual strengths 
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of the different types of approaches, while at the same time eliminating prominent 
weaknesses. The limitation of DIM-based approaches with respect to occluded areas 
and dynamic objects, for example, could be compensated by approaches that only rely 
on one input image like the one for SSDE presented in this work. At the same time, the 
depth maps computed by DIM can help to induce a scale and consistency into the single- 
view depth maps. This complementary use, however, is to this end only a suggestion 
and theoretical discussion, which needs to be implemented and further investigated in 
future work. 


Extension of the proposed framework towards the complementary use of 
image-based techniques and active sensing methods: Lastly, this work focuses 
on depth estimation from image data captured by visual cameras. This is motivated by 
the fact that appropriate cameras are available on a majority of COTS UAVs by default 
and are significantly cheaper than other sensors, such as LiDAR systems for example. 
However, the use of images from visual cameras is subjected to suitable visibility 
and lighting conditions which, in turn, depend on the environmental conditions, e.g. 
weather and daytime. In recent years, more and more sensors that rely on active 
sensing, in particular LiDAR systems, are available for the use and deployment on 
UAVs, allowing for a complementary use of image-based techniques and active sensing 
methods. Especially the high-level applications considered in the second part of the 
proposed framework, i.e. fast, accurate and reliable 3D mapping and 3D change 
detection, could greatly benefit from an additional use of active sensing techniques. 
Thus, in future work, the proposed framework is to be extended and evaluated to 
possibly also integrate data from active sensors, such as LiDAR systems, if available. 
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